{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Splitting\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from semantic_text_splitter import TextSplitter\n",
    "\n",
    "# Maximum number of tokens in a chunk\n",
    "max_tokens = 1000\n",
    "splitter = TextSplitter.from_tiktoken_model(\n",
    "  \"gpt-4o-mini\", max_tokens\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Chunk 1:\n",
      "open-endedness is essentially you know we're studying systems that can generate\n",
      "their own data in an infinite uh capacity and so it's systems that essentially\n",
      "if you run it for longer and longer they get more and more complex they generate\n",
      "more and more quote unquote interestingness or interesting data um and so if we\n",
      "can actually you know crack this snut of how do we actually come up with a\n",
      "self-improving system in the sense that keeps generating interesting data uh we\n",
      "can then use that data to train further train our models but of course you get\n",
      "into this perpetual ual uh data machine type of uh idea where obviously you know\n",
      "there's how do you generate more data uh if you know the data is ultimately\n",
      "coming from a model that you probably trained on previous data how do you get\n",
      "net new information from that well I think a lot of this is actually just\n",
      "resolved purely again going back to this idea of the reward function right or a\n",
      "preference function where there is outside information coming in through some\n",
      "sort of filtering criteria for example human designers in the loop uh or\n",
      "designers designing some sort of preference model that could essentially automat\n",
      "Ally rate the kinds of automatic uh data that's being generated by these\n",
      "open-ended systems what does Waker stand for right so Waker stands for weighted\n",
      "acquisition of knowledge across environments for robustness fantastic and what\n",
      "was the title of the paper oh right yeah reward free curricular oh God what was\n",
      "the title of the paper reward free curricular for training robust World models\n",
      "that was it okay so um give us the elevator pitch yeah totally so basically like\n",
      "the overarching um question that we're trying to answer with this paper is like\n",
      "how should we go about training like very general agents so in the context of\n",
      "the paper we think of a general agent as being one that's able to perform a lot\n",
      "of different tasks so we might think of these as different reward functions if\n",
      "we're think thinking of it from a reinforcement learning perspective um but also\n",
      "be able to perform those tasks in lots of different environments so you know we\n",
      "don't want a robot to just be able to do you know pick up tasks or do tasks in\n",
      "my like my kitchen specifically we want the robot to be able to go into like\n",
      "arbitrary Apartments and also be able to do those tasks in like arbitrary\n",
      "environments and so we we kind of thought about like yeah how do we want to\n",
      "create an agent that can do such a thing and we argue in the paper that a good\n",
      "way of doing it would be to have an agent that um has a very general World model\n",
      "um so a world model meaning that it can predict the outcome of sequences of\n",
      "actions and predict what will happen if it does certain actions and so we argue\n",
      "if we have a very general World model that can lead to a very general agent\n",
      "that's able to um perform you know a variety of tasks in different environments\n",
      "and so then you know once we've established that we kind of asked the question\n",
      "of how do we get a very general World model and what does it mean to have a good\n",
      "World model that works um well in a very general setting across different\n",
      "environments and different tasks like how do we Define that and how should we\n",
      "gather data to do that beautiful so I really enjoyed reading the paper and um it\n",
      "reminded me a lot of um Kenneth Stanley's poet paper so he was uh doing this\n",
      "thing called curriculum learning and it's it's really related to machine\n",
      "teaching as well there's quite a few things in in M machine learning where you\n",
      "say well if we had a really principled way of of selecting the best training\n",
      "data and presenting it to the Learner in the best possible order could the\n",
      "learner be better and in that poet paper Stanley was kind of generating a\n",
      "diverse set of environments and like training um a learner on those things and\n",
      "you're doing something very similar and you're using this Mini Max regret which\n",
      "is a concept from decision Theory can you bring that in yeah absolutely so um so\n",
      "I guess we have this notion of like wanting to be um to perform well across a\n",
      "wide range of scenarios right so scenarios in our context mean like different\n",
      "environments and different tasks and kind of like the most standard way of\n",
      "thinking about that especially in reinforcement learning or in machine learning\n",
      "in general is you think about like the average performance so so how do I\n",
      "optimize like the expected reward across all of these different scenarios um and\n",
      "a lot of the work that that M's done as well kind of argues that just just\n",
      "optimizing for expectation isn't necessarily the best um the best objective so\n",
      "you know we can imagine in the real world we don't really know like the\n",
      "distribution over possible tasks or or anything well you know in most situ we\n",
      "\n",
      "Chunk 2:\n",
      "don't know things like that and so maybe a better objective is to try and be\n",
      "robust instead and robust basically we can think of that as meaning like we\n",
      "should do reasonably well in every situation we could be in and that that's kind\n",
      "of what a robust objective is um and one of the ways that you can define a\n",
      "robust objective is via Minx regret and so regret means like suboptimality like\n",
      "how well did I do relative to the best I could have possibly done so it means\n",
      "basically the same thing as it does in normal English um and so the Minimax\n",
      "regret objective basically says across all possible situations I want to try and\n",
      "do um minimize the regret across all possible situations minimize the maximum\n",
      "regret I should say so that means in all possible situations we should do almost\n",
      "as well as the best we could have possibly done um and I guess just to contrast\n",
      "this against the standard objective for robustness so the more common objective\n",
      "for robustness at least traditionally is like maximan performance that means\n",
      "maximize the performance while the environment's like minimizing and choosing\n",
      "the most adversarial environment or the most adversarial scenario um but but the\n",
      "problem with kind of the max objective is that in some environments you just\n",
      "can't do anything let's say like some s situations is too hard you're doomed and\n",
      "so if in some situations you're doomed and you always get like zero reward or\n",
      "negative Infinity reward that means there's no incentive to try and do better in\n",
      "any other environment because your maximan reward is always going to be zero and\n",
      "so therefore I think like minchi argues as well as Michael Dennis and a lot of\n",
      "these recent papers argue that Minimax regret so minimizing the maximum sub\n",
      "optimality is actually like a better objective for a general agent that's robust\n",
      "fascinating so um if I understand correctly is is it a way of saying I want to\n",
      "have the best case worst um expected regret uh yes so basically Minimax regret\n",
      "is saying that uh if you assume that you know the environment is adversarial to\n",
      "you in in some way like when you're training or at inference time when you're\n",
      "actually testing your policy out in the real world um Minx regret is saying the\n",
      "agent should behave the model should behave in a way that minimizes its worst\n",
      "case possible regret uh over all the possible conditions of the world uh that\n",
      "that this adversary could choose what's really interesting about this paper is\n",
      "we are talking about the reward free exploration phase and we're also talking\n",
      "about the domain of modelbased reinforcement learning as opposed to um you know\n",
      "let's say value based reinforcement learning where um you get this entanglement\n",
      "right so the the Dynamics the model of the world it's still in there but it's\n",
      "kind of in meshed with this with this value model whereas in model-based\n",
      "reinforcement learning in a principal way we kind of separate out the parts so\n",
      "that we can do explicit planning and a and simulations and stuff like that so\n",
      "we're very much in this model based domain right yeah absolutely yes we focused\n",
      "on yeah model based reinforcement learning or some people like to call this like\n",
      "the world model setting more recently but yeah like you said we you know in\n",
      "typical like model fre reinforcement learning we we we typically aim to learn a\n",
      "policy in a value function and yeah as you said like that value function is kind\n",
      "of implicitly encoding the Dynamics through the fact that we learn the value\n",
      "function using the Balman equation so so the bman equation kind of propagates\n",
      "the information between like transition and the environments through the value\n",
      "function so so the value function will like implicitly have the Dynamics in it\n",
      "um but in modelbased reinforcement learning we want to very explicitly model the\n",
      "Dynamics of the environment and so what I mean by that is we want to be able to\n",
      "take some previous sequence of observations perhaps those are images and then\n",
      "also condition on the next action we want to take in the environment and then be\n",
      "able to predict the distribution over the next observation or state so we're\n",
      "very explicitly modeling the Dynamics of the environment okay now this is really\n",
      "interesting because you know people think about reinforcement learning and in\n",
      "reinforcement learning you don't so much care about having a model of the world\n",
      "you care about building trajectories that lead to some you know task or goal or\n",
      "whatever that you're interested in so like I mean just just in broader terms\n",
      "what what what do we get from explicitly modeling the world so there there are a\n",
      "few Arguments for why we would want to explicitly model the environment so so\n",
      "one of which is um a lot of people would argue that you get better sample\n",
      "efficiency by modeling the environment and the argument for this is you know the\n",
      "reward function might be quite sparse and so if you're just relying on like the\n",
      "\n",
      "Chunk 3:\n",
      "propagation of rewards backwards to try and learn the optimal behavior that\n",
      "might not be as efficient as actually learning the Dynam DS because the Dynamics\n",
      "can be learned from every single transition that you have it's kind of like a\n",
      "standard supervisor run supervised learning problem so so you kind of have like\n",
      "a richer signal to learn from which might arguably lead to better sample\n",
      "efficiency um but I think like the more concrete arguments that I would argue\n",
      "for are that if you have a model of the environment it's it's some kind of more\n",
      "General thing that you can then use to develop better decision- making later on\n",
      "so so if you just learn a value function you're kind of only learning how to\n",
      "optimally do that specific reward function or optimize that specific reward\n",
      "function um but if we have a model of the environment we can kind of arbitrarily\n",
      "be given some task later down whether it be a reward function or goal state or\n",
      "something like that and we can then plan to optimize that task later down the\n",
      "road so I would think that um you know it's kind of a much more General way of\n",
      "having a powerful decision-making agent rather than just specifying like one\n",
      "task and learning the optimal kind of policy for one task and I guess another\n",
      "thing that I'll add to that is um rather than only learning like a feed forward\n",
      "policy like you would and reinforcement learning so something that Maps directly\n",
      "to actions the other thing that a world model allows you to do is also to do\n",
      "online planning so you can imagine at test time we're trying to deploy in the\n",
      "environment but we can actually do a bit more further planning through the world\n",
      "model to then work out what the best action is rather than relying on just a\n",
      "newal network to immediately output an action and there's kind of a lot of work\n",
      "showing that if you can do this like planning at test time you can kind of get a\n",
      "lot of a better performance on a lot of environments especially things that that\n",
      "really rely on um search to do well things like go and like these kind of games\n",
      "where you do have to think explicitly ahead in the environment and so I would\n",
      "think the main reasons you would want to consider um learning a world model and\n",
      "maybe a last point I'll just add is that I think this is kind of a um again like\n",
      "unclear whether this is true necessarily but but I think some people would argue\n",
      "that a world model will generalize better than learning a value function so you\n",
      "can imagine like a world model is learning things like you know State\n",
      "transitions so you can imagine if you if you're training on state transitions\n",
      "the model is kind of implicitly being forced to learn something like physics or\n",
      "something like that and so if you're like very explicitly forcing the model to\n",
      "learn something like physics You could argue you know we'll go to some new state\n",
      "and the rules of physics will still hold and therefore the world model will\n",
      "still be quite good at the new state potentially whereas if you learn a value\n",
      "function I guess it's a little bit less clear as to whether you're put on a new\n",
      "situation will the same kind of structure of that value hold as it would a model\n",
      "anyway sorry that was a bit of a long answer but no no it's fascinating I mean\n",
      "when I was reading the paper the one of the reads I got is um in machine\n",
      "learning we are often overcoming the curse of sparcity so of course like in\n",
      "trajectories and reinforcement learning that that's quite intuitive but even in\n",
      "learning the world model itself the model um just because of the way they're\n",
      "trained it it tends to compress the world into small little motifs and actually\n",
      "the world is quite complicated and we need to combine the motifs together in\n",
      "lots of interesting and and Rich ways and by exploring through the world model\n",
      "we're almost kind of like making we're forcing it to make those connections yeah\n",
      "and I think um you know to follow up on Mark's um uh Mark's point I think it's\n",
      "also interesting because especially in the Waker paper uh the world model\n",
      "setting we're looking at specifically reward free world models and so\n",
      "essentially there's this uh EXP decision to separate separate out the two\n",
      "components of a world model which is essentially the Dynamics function which\n",
      "tells you how things transition from state to state how does a state transition\n",
      "state of the world transition to the next state of the world given an action\n",
      "that the model or the agent is taking in that world and the reward that it\n",
      "receives so this latter part the reward is defined by the reward function and so\n",
      "uh you know I think Mark was uh to follow up on his point a lot of the benefits\n",
      "of the world model is in this design Arrangement is that you can compositionally\n",
      "separate out this Dynamics aspect from the reward aspect so the general idea\n",
      "\n",
      "Chunk 4:\n",
      "would be why should an agent trained in such a world model be able to generalize\n",
      "to a new setting well maybe if that setting shares a lot of the underlying\n",
      "Dynamics in that version of the world for example rules of physics and the agent\n",
      "has learned how to exploit those to accomplish um navigation around that\n",
      "environment or reach different types of tasks uh achieve different kinds of\n",
      "tasks in that environment then you can um sort of superimpose a different reward\n",
      "function that essentially defines a different task because the reward function\n",
      "defines what task success is so you can essentially superimpose different tasks\n",
      "on top of that Dynamics model and you would you know you could expect that the\n",
      "agent could learn more quickly because it's already mastered sort of the\n",
      "foundational skills of navigating or manipulating different aspects of the\n",
      "Dynamics of that world we've been on a bit of a journey here um I think over the\n",
      "last few years in in the literature of um we we want to have robust models and\n",
      "we're doing that by kind of perur and you know making a bunch of manipulations\n",
      "to the environment and there there was this domain randomization and there's\n",
      "like unsupervised environment design and of course your your iteration now is\n",
      "doing this in in in the domain of um reward free exploration but can you take us\n",
      "on on that Journey sort of maybe starting with um domain randomization kind of\n",
      "just to uh elaborate on something that Mark was previously talking about which\n",
      "is that the typical you know standard setup in machine learning is to uh\n",
      "essentially optimize a model's performance uh over a uniform distribution uh\n",
      "over the data points and so this is really just randomly sampling data points\n",
      "and we try to minimize the loss over those data points for whatever objective\n",
      "we're trying to minimize or maximize in reinforcement learning um we want to\n",
      "train agents that can perform well in lots of different uh versions of the\n",
      "environment and so um you can think of each environment uh almost as a bundle of\n",
      "data points right it's kind of the set of trajectories that the agent can um can\n",
      "encounter within that version of the world and we essentially in reinforcement\n",
      "learning we want to learn to maximize uh the reward of the agent uh in that set\n",
      "of uh trajectories so we want to specifically start to uh actively pursue those\n",
      "trajectories that give us the highest reward and we learn from the reward signal\n",
      "as the feedback signal for figuring out you know which actions uh and therefore\n",
      "which trajectories will lead to maximizing that reward and so typically um when\n",
      "we operate in the multitask setting uh we essentially randomly sample different\n",
      "versions of the environment and essentially have the agent try to maximize its\n",
      "performance its reward on that random sample of environments uniformly uh\n",
      "sampled from you know the set of possible environments um and this is\n",
      "essentially uh causing the agent it'll cause the agent to learn a policy that's\n",
      "optimal for essentially uniform distribution over those environments um but of\n",
      "course this is kind of a naive assumption because we essentially are assuming\n",
      "that every possible version of the environment is equally likely which is\n",
      "obviously not true because some versions of the world will not be as likely as\n",
      "others uh for example like if you walk outside the sky is usually blue and not\n",
      "green and so you know when the sky is orange maybe that happens if you're in\n",
      "California and there's a wildfire but that's not usually the case and so instead\n",
      "what we can do is we can turn to decision Theory and think of um sort of more\n",
      "sensible approaches to what it means to act optimally uh when you're uncertain\n",
      "about uh what state of the world uh the world will be in and so the thing that\n",
      "we focus on in this paper um is this idea of Mini Max regret where it is this\n",
      "idea again of having the agent act in a way that essentially minimizes its worst\n",
      "case regret um in any possible uh state of the world so largely you know this is\n",
      "a shift from randomly sample what it means in practice is you want to shift from\n",
      "randomly sampling environments during training to essentially uh sampling\n",
      "environments that maximize the agent regret and what this means is you're now\n",
      "actively sampling for those environment settings where the agents um\n",
      "experiencing the most regret and here regret is defined just simply as what does\n",
      "the optimal agent do in that version of the environment and what did this\n",
      "current agent that's learning do in that environment and so there's this Gap in\n",
      "performance and you want to actively find those environments where that Gap is\n",
      "maximal and if you view this as this adversarial game now between you know uh an\n",
      "adversary like nature that's choosing the environment and the agent that's\n",
      "learning to solve the environment um you can think of the adversary as you know\n",
      "having a a payoff function in that game or it's rewarded for the based on the\n",
      "\n",
      "Chunk 5:\n",
      "regret that the agent experiences and the agent is trying to shrink that regret\n",
      "so the agent you can think of as being rewarded for you know um the the negative\n",
      "of that reward so the agents reward signal is you can think of as the negative\n",
      "of the regret and so now you have the setting where you can essentially view\n",
      "this training process this active sampling process as a two player zero some\n",
      "game where the adversary is you know rewarded for the regret of the agent in\n",
      "each environment it chooses and the agent is rewarded based on the um the agent\n",
      "receives the negative regret as its uh payoff and so um we know that in two\n",
      "player Zer some games there's always a uh this there's always a solution called\n",
      "a Nash equilibrium and so this is an idea in Game Theory where basically this is\n",
      "um a choice of behaviors on both parties or a choice of strategies on both\n",
      "parties in the game such that um no play can do better unless the other player\n",
      "changes their strategy and so you can think of this as a situation where you\n",
      "know I'm not neither player is incentivized to deviate from their behavior uh\n",
      "once they reach this choice of mutual strategies and so we know that all two\n",
      "players zero some games have a Nash equilibrium uh set of strategies between the\n",
      "two players and in this case uh we know there's additional theorem called the\n",
      "Mini Max theorem which says that when in a two player Zero Sum game specifically\n",
      "two players and zero sum when um you are at the snash equilibrium setting then\n",
      "each player must be playing what's called U the Mini Max um the Mini Max\n",
      "strategy which means that each player is minimizing the maximum um minimizing\n",
      "the maximum reward for the other player and so here the reward again is the\n",
      "regret and therefore just based on this known you know theorem about two player\n",
      "zero some games we know that um the agent which is you know receiving the payoff\n",
      "of negative regret it's the Min player it must be implementing the Minimax\n",
      "regret strategy and so this is how we essentially can shape the training process\n",
      "to essentially um arrive at an agent that performs Mini Max regret\n",
      "decision-making rather than decision-making that optimizes um just a uniform\n",
      "sample of environments okay so can I play back um some of those things as I\n",
      "understand it so um essentially we we are we're building a model which will\n",
      "learn to select the environments where we perform badly on and then we fine on\n",
      "those environments because we're leaning into the gaps we're saying where where\n",
      "do I perform badly let's fine tune on that and then you're saying that if we\n",
      "continue to do this as a kind of adversarial sampling game that we will reach a\n",
      "Nash equilibrium so it will converge in a good place but help me understand that\n",
      "why would it you know it seems to me intuitively that it might be unstable or it\n",
      "might not quite what why does it converge so there's no guarantees around\n",
      "convergence and and so I think this is an area where there's a lot of room for\n",
      "innovation around these methods a lot of this is um this is more I would say\n",
      "like theoretical motivation around why we think actively sampling environment\n",
      "settings based on um estimates of regret is a good idea and another Point\n",
      "related to that around sort of this gap between the theory I I just um explained\n",
      "and in practice is that uh regret itself is a pretty hard quantity to actually\n",
      "uh measure in practice because you know knowing regrets defined as what's\n",
      "Optimal Performance um minus my agent performance so you kind of have to know\n",
      "what Optimal Performance is and in general you don't know the optimal Behavior\n",
      "therefore you don't really know the Optimal Performance on any environment\n",
      "unless it's like a very toy setting and so um in practice we also use\n",
      "approximations for the regret uh in order to do this kind of active sampling and\n",
      "so um there's a lot of deviations between theory and practice um so there's no\n",
      "guarantees you know that different forms of gradient-based optimization uh for\n",
      "RL training would actually lead to converging to Nash equilibria uh a lot of the\n",
      "theory is just stating that if you were to run the system the learning system\n",
      "for a long time if we make the assumption that the optimization algorithm is uh\n",
      "fairly good at producing you know an improved response to the other player in\n",
      "this type of Zero Sum game you if you're assuming that if the successive sort of\n",
      "series of best responses uh that the optimization algorithm is generating um\n",
      "continues to improve over the previous ones you could make the assumption that\n",
      "maybe eventually it does get to that equilibrium but um there is no mathematical\n",
      "guantee that this actually happens what we want to do is um uh you know build\n",
      "this latent uh Dynamics um uh you know predictive model which is a simulacrum of\n",
      "\n",
      "Chunk 6:\n",
      "of what the idealized version is but we don't have a way of directly Computing\n",
      "the regret so we kind of perform um you know we learn a proxy for that regret\n",
      "how does that work so we think of regret in in the following way so so there's\n",
      "kind of this um old school result from like um mdp Theory or maybe it's not that\n",
      "old but like 20 years ago or something like that called the simulation lur and\n",
      "that basically says that you know if if we let's assume for now that we we have\n",
      "like an optimal planner so we can give our like model of the world to this\n",
      "optimal planner and and some reward function let's say later down the road we\n",
      "get given some reward function and so we give the model and the reward function\n",
      "to our optimal planner and we assume that this planner can return the optimal\n",
      "policy in our model um so we kind of have this you know planning Oracle um and\n",
      "if we assume that we can do that then we can think about the difference between\n",
      "like how good the policy would be from um a planning Oracle in the model versus\n",
      "the truly optimal policy in the real world and so what the simulation L tells us\n",
      "is that you know the difference between these two policies so the one found by\n",
      "acting optimally in the model versus the truly optimal one is bounded\n",
      "essentially by the error between the model and the real world under the\n",
      "distribution of states that the policy uh would generate so so you know it only\n",
      "it only matters that we have low eror where the policy would go essentially\n",
      "because you know if there are some states that are just completely irrelevant\n",
      "what the policy is going to do it's not really going to matter if the if the\n",
      "models not accurate there um so we kind of use this result to think about the\n",
      "regret so that that gives us like you know if we have like one um one true mdp\n",
      "and one model of an mdp and one reward function the simulation limit can tell us\n",
      "you know what would kind of be the regret if we did this optimal planning um\n",
      "within this one model of the um of the mdp um but then in our work we're we're\n",
      "not really interested in the sitting of like one mdp one reward function um so\n",
      "we start to think about you know what happens if we have arbitrarily many\n",
      "environments as well as arbitrarily many reward functions which we don't know in\n",
      "advance and then I guess the other thing that I should say like you you alluded\n",
      "to like latent Dynamics is you know these existing results are assuming that we\n",
      "have an mdp that's fully observable meaning you know exactly what the state of\n",
      "the environment is um but usually when we think about like World models or even\n",
      "or just maybe more modern reinforcement learning we're really interested in\n",
      "learning from like quite high-dimensional signal so images or maybe well\n",
      "probably images but maybe there are other high high dimensional signals we we\n",
      "want to reason about um and because we're just using image observations this\n",
      "means like the world is like partially observable like we can't infer everything\n",
      "we need to know about the world just from one image you know for for basically\n",
      "any physical task like the velocity of objects is important but you can't infer\n",
      "that just from one image um so in this partially observable environment we\n",
      "really want to take um a sequence of observations because we need to to use\n",
      "those sequence of observations to infer what the state is so you know viewing a\n",
      "sequence of images will help me to infer what the um the velocities are for\n",
      "example and so we can think of this as inferring like a belief a belief over\n",
      "what the state is in a partially observable nvp um so we need this full sequence\n",
      "of images and we need to use the full sequence images to then to be able to\n",
      "predict ahead what the next observation will be um that's kind of what you know\n",
      "most World models are attempting to do um but if we just like Tak in a bunch of\n",
      "images and then try and directly predict images again that's like quite a hard\n",
      "problem um to just like just predict straight an image space and so the most\n",
      "common thing to do is kind of to take your previous sequence of images and then\n",
      "try and get like some compressed representation of the history of images into\n",
      "like the latent State um and then predict the Dynamics in the latent state so\n",
      "yeah so I have my sequence of images I kind of compress these somehow into some\n",
      "vector and then I give it a new new action and I try and predict what the next\n",
      "kind of latent Vector will be given this new action and this now represents my\n",
      "prediction of the the Dynamics in the world and then if I want to um you know\n",
      "predict what the next observation would be in image space then I can also decode\n",
      "\n",
      "Chunk 7:\n",
      "that back to an image um but then a lot of works also argue that maybe we don't\n",
      "want to actually learn to predict the entire image so maybe you don't want to\n",
      "actually decode the entire image but that's that's another aspect that we might\n",
      "want to get into but there's this whole broad story of of um working in the\n",
      "latent space and um in reinforcement learning there was that paper called World\n",
      "models by you know David Haron and Schmid Huber and it it also I think has a\n",
      "relationship with you know what laon's doing with jeer and these like you know\n",
      "joint embedding prediction architectures so there seems to be something magical\n",
      "about working in in the latent space and also you were talking about um you know\n",
      "partially observable Markov decision processes and you know that seems to be\n",
      "this idea that we need to have a modeling framework for the world and I I guess\n",
      "like the ideal situation would be is that like we just we we knew exactly what\n",
      "would happen you know every single time step in every single state um but we\n",
      "don't you know so so so we model it as a partially observable Markov decision\n",
      "process and the Markov bit is quite interesting as well I mean maybe um you guys\n",
      "can just introduce why do we use that as a model so markovian basically just\n",
      "means you only need to look at like the current state to be able to infer all\n",
      "the information about the system um so so in a Markov decision process we have\n",
      "some State and then we assume that we're able to take some actions and given\n",
      "some State and some action we get some distribution over next states of the\n",
      "system and then the the system will transition according to that distribution to\n",
      "the next state and this is just like kind of a general framework for modeling\n",
      "like systems that we might want to control so you know it kind of dates back to\n",
      "like early work and control theory but then it's also the main framework used in\n",
      "reinforcement learning um yeah in the reinforcement learning setting because\n",
      "it's the decision process we we also add in a reward function which tells us how\n",
      "good it is to be in a certain state or to execute a certain State action pair um\n",
      "but yeah as you said with relating to like partial observability in a lot of\n",
      "like systems we we don't actually know what the true like state of the world is\n",
      "so so you can imagine you know if we want to think of the entire world as a\n",
      "partially observable mdp we can't just have some Vector telling us exactly what\n",
      "the true configuration of the world is or or maybe that exists but we can't we\n",
      "definitely can't just know that and so we usually think of it as being a\n",
      "partially observable system um so this means that like given given the state um\n",
      "you know at each step we'll basically get some distribution over observations\n",
      "and we just get to observe that observ so you know the state of the world could\n",
      "be what it currently is in here and maybe my observation is like a camera image\n",
      "so I only get some camera image of the world that allows me to infer a bit of\n",
      "information about the state um and because it only allows me to infer a bit of\n",
      "information about the state it doesn't tell me the whole state it really you\n",
      "need to keep track of all the observations you have to be able to keep track of\n",
      "all the information you have about the world so you know you could imagine um if\n",
      "the task is for me to remember how to get out the door a while ago um you know I\n",
      "I I don't just need need to be able to like look at my current image of the\n",
      "world to be able to infer that information I need to have kept track of like all\n",
      "my previous information as well um so that's kind of why we think about often\n",
      "want to think about like partially observable environments as opposed to fully\n",
      "observable ones amazing amazing so so so mention maybe you can um uh bring in\n",
      "this this latent idea sure and and sort of contrast that to what Lon is doing as\n",
      "well sure I mean so I think in machine learning and deep learning uh there's\n",
      "this General Paradigm that's been around you know since the Inception which is\n",
      "learning uh late latent representations of data and one of the benefits of\n",
      "learning latent representation is that um you know ideally your objective uh\n",
      "that leads to learning these latent representations is that you are ultimately\n",
      "learning a lower dimensional representation of the data or dynamics that you're\n",
      "modeling like in our case with the world model um that captures just what is\n",
      "necessary it's a more compact representation of just the information that's\n",
      "necessary to predict the task you're trying to predict and so um with uh with\n",
      "our case uh or Laten space World models a lot of the benefit of working in the\n",
      "\n",
      "Chunk 8:\n",
      "latent space is that if uh as opposed to working in the full image space for\n",
      "example if your observations are images like in a video game is that there could\n",
      "be a lot of sporus features or you know a lot of additional information that you\n",
      "could be expending lots of compute and um you know gradient updates just to\n",
      "learn those patterns when they don't actually impact the ultimate um transition\n",
      "Dynamics or reward dynamics that you need to learn in order to do well in that\n",
      "environment so one example is if you have a game where you know maybe the\n",
      "background is different uh because it's daytime or nighttime or it's close to\n",
      "Sunset um but ultimately you know the background doesn't really impact uh how\n",
      "the player moves around in the environment or whether they've reached the end\n",
      "goal of the task and so if you're training a uh model where it needs to compress\n",
      "a lot of this information first into a smaller dimensional latent Vector latent\n",
      "representation um you don't really need you would expect that latent\n",
      "representation not to actually capture it would start to ignore the background\n",
      "color and it might only capture certain features of the environment that can um\n",
      "essentially if you were to decode it back out it might only capture certain\n",
      "information about the environment that's predictive of the actual task that you\n",
      "want to solve um so maybe if the task is to say reach a coin at the end of a\n",
      "level then maybe the lant representation would capture the presence of the coin\n",
      "or whether the the proximity of the character you're controlling to the coin um\n",
      "and so uh with the JEA related work I think a lot of this is also you know\n",
      "motivated with with this idea where if we can learn a better latent space\n",
      "representation um of images or videos or whatever modality we're trying to model\n",
      "um it's a much lower dimensional computationally efficient representation uh\n",
      "that you can um you can effectively use for Downstream tasks um I'm not s I'm\n",
      "actually not super familiar with exactly you know the the visual JEA uh uh\n",
      "objective so I don't think I can say too much about that oh that's okay yeah I\n",
      "mean but but yeah I mean you pretty much nailed it so um I mean Lon even gives\n",
      "the example of like um you know in self-driving cars you might not be interested\n",
      "in the leaves on on on the road you know so like with increasing levels of of\n",
      "nesting you kind of like learn to ignore the things that are not relevant and\n",
      "focus on the things that that are relevant but we we're almost getting to the\n",
      "center of the bullseye here so in intelligence to me is all about model building\n",
      "and and and that's what these abstractions are they're models that kind of are\n",
      "predictive about the thing that that that's relevant and kind of like ignoring\n",
      "what is not relevant and we build better models when we have a curriculum\n",
      "apparently this happens in nature as well Max Bennett I was talking to him the\n",
      "other day and he said you know our genome doesn't encode all of our skills um\n",
      "explicitly because it would be too inefficient to do so but they do encode a\n",
      "kind of curriculum so we teach babies you know we Babble with babies and we\n",
      "teach babies how to talk and stuff like that so so the curricula is is really\n",
      "important and then we we're getting to the center of the bullseye which is\n",
      "intelligence in in general now I think Lon thinks that it's specialized and and\n",
      "what that means is that there there motifs that statistically generalize and\n",
      "what that means is that you do need environments you need to find motifs that\n",
      "are present in in as many environments as possible and those are the\n",
      "generalizing features do would you agree with that yeah definitely I think that\n",
      "a lot of um so a lot of really powerful machine learning methods for example uh\n",
      "are trained in simulation and when you're training in simulation there's a\n",
      "concept in control from control literature uh called the sim2 real Gap and\n",
      "essentially this is essentially quantif in a performance difference between uh\n",
      "well it's quantifying a few things one is just how different is the are the\n",
      "actual physical or other other kinds of Dynamics captured by your simulator\n",
      "compared to reality so if you have a physics simulator how accurate are for\n",
      "example the friction Dynamics or different kinds of contact Dynamics uh in your\n",
      "robotic simulator compared to those actual Dynamics in the real world with a\n",
      "real robot um and this also leads to a Sim tooreal Gap in terms of performance\n",
      "so if you train in the simulator you know a lot of times what machine learning\n",
      "is really good at is is really good at learning to exploit whatever system\n",
      "you're training the uh the model in and so it's fairly um common for you know\n",
      "systems that or models that are trained within a simulator to learn to\n",
      "\n",
      "Chunk 9:\n",
      "eventually exploit the simulator and so actually like one big area of um games\n",
      "AI is using is actually leveraging this idea where they essentially use ml\n",
      "models they optimize ml models to within a certain game environment to try to\n",
      "find bugs within that environment to look for exploits automatically um so ml\n",
      "system is very good at finding exploits and whatever system you have but then\n",
      "the issue is those exploits are usually where exactly where the gap between your\n",
      "simulator and re reality resides and so you actually don't want your model to\n",
      "learn to exploit these differences between the simulator and reality to get a\n",
      "high performance uh because that kind of defeats the purpose of then later\n",
      "transferring your model that's trained in simulation to reality because now in\n",
      "reality obviously the model can't exploit those same those same glitches within\n",
      "the simulator um yeah so yeah yeah I mean because the reason this is really\n",
      "interesting is is that the the premise of your paper is that it is possible to\n",
      "build a generalist agent which means it's an agent that can be fine-tuned and\n",
      "worked really well on a on a whole bunch of Downstream tasks and to me that\n",
      "implies that at least in our physical world in any situation you might use this\n",
      "agent that there are General motifs that it could have learned during\n",
      "pre-training that it could like you know become activated in any situation um\n",
      "does that is is that fair yeah May I can say something about um just the way\n",
      "that we should could think about like the different like latent Dynamics\n",
      "objectives so so I think I agree that like at least when I try and think about\n",
      "how I think or how people think I think I agree that like you know a truly\n",
      "intelligent system should kind of think through the world and like a very\n",
      "compressed representation of the world like if I'm trying to like think through\n",
      "how to go to the airport like I'm definitely not like predicting ahead in terms\n",
      "of like the raw image space of trying to predict every image I might observe on\n",
      "the way to the airport and things like this and so I think we have this kind of\n",
      "like trade-off between you know um like you said with the VF paper like should\n",
      "should we just try and like um kind of basically model like the minimum\n",
      "information we need about the world to try and you know do the do the relevant\n",
      "task in the world and I think what you're saying I think that probably is maybe\n",
      "more what we think about when we think about like human intelligence or\n",
      "something like that um but then there's also this other way where we just say\n",
      "we're going to just like train enforce the model to be able to predict ahead\n",
      "every single image and so in our paper we do actually enforce that the model has\n",
      "to predict the next image um and so um basically what this might mean is yeah\n",
      "maybe the model does you know hopefully it does like like you said like kind of\n",
      "capture the underlying like true things that matter in in in the environment but\n",
      "it might also mean like what we were saying with like the leaves example like\n",
      "this might Force the model to kind of capture a lot of irrelevant details that\n",
      "don't really matter like the leaves on the ground and things like this and so\n",
      "you know maybe that means it isn't actually capturing the underlying motifs it's\n",
      "actually just getting good at image Generation Um but then or or image\n",
      "prediction I should say um but then I've also heard arguments kind of saying you\n",
      "know so what if people don't really think in terms of like image prediction you\n",
      "know I you know we think in terms of like like these high level motives but\n",
      "people have other people would argue that you know kind of the machine learning\n",
      "Machinery is there to do really good image prediction so so if if we if we can\n",
      "get a model that can actually just like predict images ahead really well um and\n",
      "not really worry so much about whether it's reasoning about these like high\n",
      "level features you know if you can predict images ahead really well you know\n",
      "that's enough to make to do good decision- making a lot of context so I think\n",
      "there's this kind of like contrasting ways of thinking about you know image\n",
      "prediction is good enough we'll just predict like really visually good scenes\n",
      "and that will be good enough for decision- making or do we want to the model to\n",
      "try and reason about like more abstract features of the environment and that's\n",
      "kind of a more intelligent way of reasoning about the world um and yeah I think\n",
      "that's a very interesting tradeoff um yeah yeah I mean like it's um like the\n",
      "biggest problem in machine learning is overfitting you know so as you say like\n",
      "that there all of these statistically generalizing features but they generalize\n",
      "within a domain and the domain might be like your your simulator or like you\n",
      "\n",
      "Chunk 10:\n",
      "know how you're training it rather than how it's being used in in production and\n",
      "then as you say that there's also this Almost Human chauvinistic or puritanical\n",
      "view on this which is that well um you know it does the right thing for for the\n",
      "wrong reasons or or I I use different motives to do the reasoning so that thing\n",
      "must be doing it wrong you know what I mean and um I was talking with Chris\n",
      "Bishop at MSR the other day and and you know he's um big on symmetries and you\n",
      "know the kind of stuff that like Max Welling and Tak Cohen and bronstein and um\n",
      "deep M have done loads of cool stuff on on this but it's this idea that like we\n",
      "know the world um has a certain geometry it has certain physical prior so like\n",
      "we can deliberately um you know kind of construct the approximation class in\n",
      "machine learning method so so that like we it an easier problem right because\n",
      "because we know we know the thing is in there yeah so I mean I guess sort of the\n",
      "uh slight tangent I went into around the simt Gap I guess part of the point I\n",
      "wanted to make there is that um you know one way around the Sim to Gap is you\n",
      "could try to train um you could try to parameterize a very large space of\n",
      "possible versions of reality and this is kind of the motivation behind this\n",
      "method of domain randomization where you sort of say this is the you know this\n",
      "is the specific task domain I care about I can parameterize the different uh\n",
      "vers of the task with a few parameters and I basically want to search over the\n",
      "space of parameters and train my model or my agent on all possible variations of\n",
      "this world but obviously that's not very sample efficient because that design\n",
      "space could be huge could be massive and so instead we like these active\n",
      "sampling strategies like we were talking about earlier uh around Mini Max regret\n",
      "style um active sampling where you sample those environments that maximize your\n",
      "regret or some other type of objective maybe like uncertainty uh similar to what\n",
      "we do in the Waker paper um but ultimately these things this active sampling\n",
      "process it leads to uh what we like to call an auto curricul automatic\n",
      "curriculum um and this is in contrast to Prior curriculum learning works because\n",
      "here this is um an automatically generated curriculum so you you can kind of not\n",
      "have any predefined notion of what is easy or hard it's purely fixed to what is\n",
      "easy or hard for the model in terms of how good the model is at performing at\n",
      "those tasks and so it's nice it's an automatic curriculum so you can think of it\n",
      "as almost like weaving a path through this high-dimensional design space\n",
      "automatically such that if the uh agent or model were to train on data along\n",
      "this path of environments through its experiences in this path of environments\n",
      "during the training curriculum it'll basically be maximizing some sort of\n",
      "Information Gain objective um because you know for example regret if there's a\n",
      "high regret that's that means there's a high uh ceiling there's a high Gap in\n",
      "terms of how much the agent can improve which implies that there's a lot more\n",
      "for the agent to learn in those environments so it's sort of this like Optimal\n",
      "you want to find this optimal path weaving through this High dimensional design\n",
      "space of environments now the danger here is that as you do this uh Auto\n",
      "curriculum the auto curriculum uh could also get go Haywire very easily because\n",
      "the design space is so big if you're training and simulation which we have to do\n",
      "because these methods are so sample inefficient we need so much data to train\n",
      "them um you want to train in simulation but if you're doing the auto curriculum\n",
      "in the simulation design space it could start to Veer very easily and quickly\n",
      "into different corners or niches of the design space where um you know the\n",
      "parameters no longer really make sense in terms of mapping to a physical reality\n",
      "or a real world scenario that we as human users uh actually care about and so\n",
      "kind of it would be you know it would defeat the purpose of spending all this\n",
      "compute to train this model that could then help us in the real world because\n",
      "now it's veering off into parts of the design space that don't really matter for\n",
      "humans it's kind of noisy parts of the design space and so this kind of leads us\n",
      "to this question of grounding how do we ground curricula how do we align the\n",
      "curricula such that you know they can still do their exploration through this\n",
      "active sampling type of procedure over the environment design space but at the\n",
      "same still at the same time maintain at least some proximity to the parts of\n",
      "that design space that are relevant to what humans care about in terms of the\n",
      "actual tasks they represent I've been speaking with Kenneth Stanley a lot\n",
      "\n",
      "Chunk 11:\n",
      "recently and we're talking about open-endedness and in general I've been trying\n",
      "to come at this problem from multiple angles and I've been using the lens of\n",
      "agency because I think agency is something that happens in the real world and\n",
      "that's why we have this Divergent process because we have multiple agents you\n",
      "know kind of like you know undirected following their own gradient of\n",
      "interestingness so in in evolution that's a great example of that it is this\n",
      "Divergent process but it's also grounded it's physically grounded you know so\n",
      "it's like the physical world creates some kind of constraints on on on the\n",
      "things that that are found and um I mean you know Clon called this AI generating\n",
      "algorithms there's quite a few different takes on this but the idea is that um\n",
      "to search this complex search space we we need to have a Divergent search and\n",
      "that's like we actually need to create the problems and the solution so like in\n",
      "the real world the the the you know the drafts had the problem of like eating\n",
      "the leaves from from from the trees and the problems and the solutions get\n",
      "generated in Tandem and this whole thing just kind of grows and grows and grows\n",
      "and that seems to be the most important feature that is missing in current AI\n",
      "systems and the grounding or the um Stanley calls it the gradient of\n",
      "interestingness I'm not sure whether You' agree with that but um I mean what\n",
      "Mark what what what what do you think about the importance of like this\n",
      "Divergence in in AI kind of the current Paradigm of machine learning of kind of\n",
      "like you know Gathering some data set beforehand or specifying some simulator\n",
      "beforehand if it's reinforcement learning is kind of good enough to do like a\n",
      "lot of reasonable tasks that we might care about um you know like obviously like\n",
      "predicting language or generating simulated language or performing very well at\n",
      "some simulated task and RL but it definitely seems like the next step towards\n",
      "like very general agents that are kind of you know I guess maybe I don't know if\n",
      "we want to use the term AGI but there something something more along the lines\n",
      "of a general agent that's kind of you know able to kind of self-improve and\n",
      "learn in more diverse environments um it definitely seems like that's kind of\n",
      "the next step of where machine learning will go and if we're going to get to\n",
      "that point I kind of agree with the idea that you know it certainly doesn't make\n",
      "sense to have some agent just randomly trying to gather completely random new\n",
      "knowledge like it certainly seems to make sense that you know you know even as a\n",
      "human to improve your intelligence you kind of selectively try and find out the\n",
      "areas in which like you can gather more more information or more knowledge and\n",
      "things like this and this is kind of what you know leads to this kind of I guess\n",
      "branching or you know like you said like the diverse set of things um that you\n",
      "might want to learn more about and so yeah I think like it clearly seems to make\n",
      "sense that like this kind of more openend this thinking is probably going to be\n",
      "like the next Paradigm of how we think about these kinds of systems but I I\n",
      "think M will had more to say about this I think the reason open-endedness is so\n",
      "interesting now is I think we're uh there's there's a few reasons why I think\n",
      "it's like newly relevant to this current ERA of machine learning because these\n",
      "ideas have been around for quite a while like um Ken Stanley Joel Layman um Jeff\n",
      "cloon uh Lis asaurus these a lot of these researchers they've they've been\n",
      "thinking about open-endedness and novel tbas search Divergent search for decades\n",
      "um I think it's really interesting to think about why there's sort of this\n",
      "Resurgence of these ideas now and I think a lot of it is because um it is again\n",
      "you know it's it's sort of following the same um sort of uh Tailwinds that have\n",
      "been driving a lot of the ml industry which is just like uh much better compute\n",
      "much larger data sets and I think what we're seeing now is that we know that\n",
      "modern deep learning methods work best when we can scale up the compute and the\n",
      "data that's how you get them to work um to to their Max small capabilities um at\n",
      "some point we're going to run out of data and a lot of people are now starting\n",
      "to talk about you know this as sort of a pending issue on the horizon which is\n",
      "you know at the current rate of consuming data for training our foundation\n",
      "models at some point we're going to run out of data we're where are we going to\n",
      "get the next trillion tokens from um and so I think a lot of this uh now points\n",
      "\n",
      "Chunk 12:\n",
      "a lot of the interest to open-endedness because open-endedness is essentially\n",
      "you know we're studying systems that can generate their own data in an infinite\n",
      "uh capacity and so it's systems that essentially if you run it for longer and\n",
      "longer they get more and more complex they generate more and more quote unquote\n",
      "interestingness or interesting data um and so if we can actually you know crack\n",
      "this nut of how do we actually come up with a self-improving system in the sense\n",
      "that keeps generating interesting data uh we can then use that data to train\n",
      "further train our models but of course you get into this Perpetual uh data\n",
      "machine type of uh idea where obviously you know there's how do you generate\n",
      "more data uh if you know the data is ultimately from a model that you probably\n",
      "trained on previous data how do you get net new information from that well I\n",
      "think a lot of this is actually just resolved purely again going back to this\n",
      "idea of the reward function right or a preference function where there is\n",
      "outside information coming in through some sort of filtering criteria for\n",
      "example human designers in the loop uh or designers designing some sort of\n",
      "preference model that could essentially automatically rate the kinds of\n",
      "automatic uh data that's being generated by these open-ended systems and if we\n",
      "can do this kind of filtering we can essentially automatically find start to\n",
      "automatically find uh useful net new data net new trajectories net new even you\n",
      "know maybe um sentences like tokens or uh net new content to train our models on\n",
      "I've been thinking a lot about creativity recently and and I I think creativity\n",
      "is is is the other half of the coin of intelligence so in the world we live in I\n",
      "think that the intelligent process is is us we are a Divergent search and we are\n",
      "um basically tackling a complex search space and we are building knowledge and\n",
      "we we are mimetically sharing them in our society we're embedding them in our\n",
      "language and then language models come and acquire all of that knowledge so the\n",
      "cynical take is that AI today doesn't you know generalize and you it doesn't it\n",
      "doesn't creatively find new knowledge it just is a representation of the\n",
      "knowledge that we have found but it's not black and white is it so the work that\n",
      "you're doing is a great example of no no no you can generate new knowledge by\n",
      "exploring these complex search spaces and even though you're exploring existing\n",
      "models you're discovering interesting and novel combinations of those models\n",
      "that have not been found before so it's creating a novel margin on something\n",
      "that was not there before but I suppose the ideal future we want to get into is\n",
      "that we really can just from a far deeper level generate new knowledge yeah I\n",
      "think one interesting thing that I've been thinking about more recently you know\n",
      "is that um sort of the you know the high level question is just right now all of\n",
      "the state-of-the-art AI systems from chat gbt to stable diffusion style models\n",
      "for text image Generation all these systems they're they're amazing very\n",
      "impressive you know like 5 years ago I would not have believed that these\n",
      "systems could exist at this level of performance today but uh ultimately uh what\n",
      "they do is they're in the they're they're in the QA business so I basically ask\n",
      "these systems a question or I give them a command and they give me an answer um\n",
      "and so I think the next Frontier of AI is really how do we Design Systems that\n",
      "don't just answer questions but they actually are the ones that start to ask the\n",
      "questions and I think once we can have ai systems that start to ask interesting\n",
      "questions um that's when we start to get closer to I think traditional Notions\n",
      "of what uh strong AGI might be okay so so again really really interesting now so\n",
      "we're getting into agency and and people think that oh you could give a language\n",
      "model agency you just like you know run it in a loop and interesting things will\n",
      "happen well well it that's not true because the whole point of open-endedness is\n",
      "to prove that existing systems converge they don't diverge they don't accumulate\n",
      "information so we would need to create a kind of agent that like you know it\n",
      "would just keep running and it would just keep doing interesting and novel\n",
      "things it would keep accumulating information and I think that the reason why\n",
      "language models don't have agency is because they are essentially um a low\n",
      "entropy model and what what that means is during training a lot of the the sort\n",
      "of like the unnecessary um you know complexity was snipped off so the models\n",
      "only know about relevant things in the next step what's the next best token and\n",
      "it it feels like we would need to have not only a higher entropy search but we\n",
      "we would also need to have um a diverse set of models that are actively\n",
      "\n",
      "Chunk 13:\n",
      "continually learning and and diverging from from each other but that's just my\n",
      "take I mean what do you guys think about that yeah I think that so I guess this\n",
      "relates quite a lot to this idea of like intrinsic motivation which is something\n",
      "that we utilize in our paper and I guess I guess the idea with that is like you\n",
      "know if we're trying to like gather new data in an environment like we shouldn't\n",
      "necessarily be constraint to just trying to new gather new data that's like good\n",
      "for a specific task um and so I I guess this kind of you know so intrinsic\n",
      "motivation basically says I should just gather new information because it's\n",
      "novel um and things like this and so we can basically like specifically try and\n",
      "gather information that you know reduces our uncertainty about the environment\n",
      "and and um or or similar objectives that that don't rely on some external reward\n",
      "signal and I think we when you get to the situation where the model is able to\n",
      "like self-improved in the absence of an external reward signal so intrinsic\n",
      "meaning that the the signal for what you should get is just purely generated by\n",
      "the model so it's purely intrinsic to the model um so I think the situation\n",
      "where you know you have the model that's able to self-improve without any\n",
      "external signal without a human having to Define what the reward is or what the\n",
      "objective is or this was good data this was bad data um I feel like that does\n",
      "feel like a lot closer to the notion of agency because of the fact you don't\n",
      "have kind of some external person defining what's good and what's bad and so\n",
      "yeah I think like this like and you also mentioned the word like creativity\n",
      "because I think at least in the context of things that I've done in terms of\n",
      "machine learning and reinforcement learning I think like intrinsic motivation\n",
      "feels like the closest thing related to creativity so you're basically like\n",
      "trying to gather information because it's novel or because you think it's or the\n",
      "model thinks it's interesting rather than because um you know it it satisfies\n",
      "some objective and so I think we could maybe say like intrinsic motivation is in\n",
      "some sense like an objective for being creative as well um I don't know if you\n",
      "have any thoughts about this yeah I think I think that uh it's I think there's\n",
      "definitely a a hugely deep connection between intrinsic motivation and uh\n",
      "creativity um in the literature intrinsic motivations also sometimes called\n",
      "artificial curiosity so this is a term that was coined by Jurgen Schmid Huber um\n",
      "could you could you explain it just what it is yeah yeah so oh yeah so taking a\n",
      "step back intrinsic motivation is essentially um in in reinforcement learning we\n",
      "train on reward signals and as Mark was saying um we typically train on external\n",
      "reward signal by external we mean that this is a task-based reward so this is um\n",
      "external in the sense that something outside of the agent that's learning like\n",
      "the human system designer decided that this is what the reward signal is for the\n",
      "task uh intrinsic means that we want to we don't design directly the reward\n",
      "signal but we're actually using some aspect of the model itself in order to\n",
      "drive the model's learning forward and so one example of this could be\n",
      "prediction error so if the model U has a large prediction error on a certain\n",
      "task like averaged over each time step we can use that as a reward signal and\n",
      "say hey you want to visit more parts of the environment where you're bad at\n",
      "predicting um how the state will transition when you act in that part of the\n",
      "environment and so uh as you can see this is very similar to maybe like intu\n",
      "Notions of what curiosity is uh curiosity and different forms of play um in the\n",
      "psychology literature A lot of people actually argue that you know different\n",
      "forms of play uh and curiosity really they they amount to you can model these\n",
      "behaviors as essentially a person trying to uh engage in activities where you\n",
      "know they're not very good at predicting the outcome and that's kind of what\n",
      "makes you could argue that's kind of what makes certain kinds of uh\n",
      "entertainment fun because or entertaining because you can't actually predict\n",
      "what will happen um you know in a few frames of the movie like like a movie\n",
      "wouldn't be very interesting or a book would not be very interesting if you can\n",
      "predict what will happen in the rest of the book just by reading the first few\n",
      "pages uh and so intrinsic motivation is really saying let's guide the model\n",
      "towards parts of the environment or the world or experiences where it's\n",
      "similarly unpredictable Stanley speaks about this this concept of deception or\n",
      "he calls it the false Compass which is this idea that any objective and and even\n",
      "you you could say exploring all of the search bace is an objective so he said\n",
      "\n",
      "Chunk 14:\n",
      "every objective has deception and if you monotonically optimize any objective\n",
      "you will always lead into you know like a a deceptive part of the search page\n",
      "but then like the counterargument is say okay well let's let's not um let's not\n",
      "have any principles for doing the um you know the exploration let's just do\n",
      "something completely random and that doesn't seem very good so so then you know\n",
      "there's this concept of well how how do I how do I imuse some concept of what's\n",
      "interesting without falling victim to deception yeah so Ken Stanley uh has a\n",
      "famous essay in the realm of open endedness where he points out um that this\n",
      "notion of interestingness is uh ultimately a subjective concept and so even in\n",
      "the case of intrinsic motivation which I think is you know in practice we can\n",
      "get a lot of mileage out of this um and we've seen this in a lot of domains\n",
      "where uh exploration helps a lot like even in the Waker paper it's largely\n",
      "founded on this idea on how we exploit intrinsic motivation uh for learning\n",
      "World models but um ultimately you know these these modelbased uh measures of\n",
      "intrinsic motivation they are by definition based on the particular model at\n",
      "play and so um at some point you know you're you're starting to overfit to what\n",
      "that specific model finds interesting and of course what that model finds\n",
      "interesting if your measure of interestingness is something like a prediction\n",
      "error um is going to be a function of you know the specific architecture of the\n",
      "model the actual inductive biases of that model uh the capacity of that model to\n",
      "learn and so you could imagine a model where you know at the beginning it's\n",
      "looking for lots of interesting parts of a particular video game environment but\n",
      "at some point you know it might saturate what it can represent and what it can\n",
      "learn and at some point it might start to find things it's explored before\n",
      "interesting just because it's starting to forget those parts of the environment\n",
      "you know if you have like a very rich stream of different kinds of environments\n",
      "that it's exploring so ultimately this is like an example of deception because\n",
      "now it's like I I think that my model is the model thinks it's exploring parts\n",
      "of the environment that it finds interesting based on this prediction error But\n",
      "ultimately it might actually start to go back to other parts of the environment\n",
      "because of issues of model Capac capity and another really famous example of\n",
      "this issue would be like the noisy TV so like if your environment has you know\n",
      "this this um noisy TV where it's just showing random noise random RGB pixels um\n",
      "you know that's you know that's not something you can actually predict because\n",
      "it's just noise and so the model if your intrinsic motivation is really just to\n",
      "search for novelty in the form of prediction error it might just start staring\n",
      "at this TV forever because it's something that it just can't predict and it'll\n",
      "just by looking at that TV it'll be maximizing its prediction error yeah yeah\n",
      "it's so interesting um so so just coming into Rich suton a little bit so he had\n",
      "this idea called um reward is enough and and essentially that that's making the\n",
      "case that you know just using um implicit uh motivation all the stuff that that\n",
      "you've just been speaking about using this trajectory um you know optimization\n",
      "process that we can do everything we need to do and in in your paper you're kind\n",
      "of making an argument similar to what lcon has been making for years about\n",
      "self-supervised image learning that what we should do guys is let's let's kind\n",
      "of pre-train a base model so this model um understands environmental Dynamics\n",
      "really well and then we stick a reward in there and we build um agents after\n",
      "that so does it in any way reenforce or pun intended uh Sutton or or or or do\n",
      "you think it's still complimentary I think it's still complimentary at least if\n",
      "I understand the the meaning of the reward is enough paper because my\n",
      "understanding of that um line of thought is basically saying that you know we\n",
      "can kind of specify you know any task that we might might want an intelligent\n",
      "agent to do as optimizing a reward in some like mdp or prdp so Mark of decision\n",
      "process or something like that and I think our work isn't contrary to that in\n",
      "the sense of like you know I I do think that that probably is a sufficient\n",
      "framework to be able to model any any kind of behavior that we might want an\n",
      "agent to do but I think when it comes to actually like practically implementing\n",
      "that idea there's a lot of difficulties so the first one might be um you know\n",
      "how do we even specify that reward function so you know if the reward function\n",
      "is to um have a good life or something like this like there's obviously like you\n",
      "\n",
      "Chunk 15:\n",
      "know maybe there is some like numerical way of defining that in terms of an mdp\n",
      "but there's like not actually a good way of of writing down that function that\n",
      "Maps what I do to whether I'm getting good rewards and so I think there's this\n",
      "kind of like you know I think that's a good framework for like thinking about\n",
      "any problem but then you have these kind of like practical issues of how do you\n",
      "actually Define rewards and how do you how do you say like whether an agent's\n",
      "doing well or not doing well and things like this um and so I think that's still\n",
      "um even with the world models lines of work I think that's still like kind of\n",
      "quite a difficult issue so so so the world models lines of work kind of you know\n",
      "allow you to model you know predicting a in the environment which is a very\n",
      "useful thing for doing a lot of tasks um but then if you actually want to\n",
      "optimize some specific task you still have this problem of like how do you\n",
      "define the reward and so we eventually want to get to this point of being able\n",
      "to like inject a reward into the world model so we're kind of in agreement with\n",
      "that kind of line of thinking in the sense we're eventually going to use a\n",
      "reward to derive the the the desired intelligent Behavior so I don't think\n",
      "there's any conflict in that sense but we still have this kind of problem of how\n",
      "do we inject that reward into the the world model how do we Define what that\n",
      "reward should be um and the case of um you know one of the easiest things to do\n",
      "for example would just be to label each image with a reward and then you can\n",
      "kind of encode that image into the latent space of the world model and then use\n",
      "that to Define how good a certain thing is and that's kind of the style of\n",
      "thinking we think of on our work um but I don't think that overcomes this like\n",
      "overarching issue of in general it's you know rewards can Define everything but\n",
      "how do you in practice like get that function is pretty hard yeah yeah I mean in\n",
      "a sense reward is enough is sort of a tonology because once you know the reward\n",
      "um if you know the the reward function for your environment you can essentially\n",
      "compute the value function which gives you the optimal policy um and so reward\n",
      "has to be enough if you know the reward function and so uh I think the more\n",
      "interesting question is definitely like what is enough for the reward what is\n",
      "enough to actually have a system automatically figure out what are interesting\n",
      "new rewards for us to train new agents or new models on or continue training\n",
      "existing models on um and I think this goes back to the question of environment\n",
      "Design This is largely the motivation of that line of work this autoc curricular\n",
      "environment design where essentially if we can automatically weave through this\n",
      "path of possible environments of the design space of the environments the design\n",
      "space uh clearly will Encompass like a big part of the design space is also\n",
      "encompassing the reward for those tasks and so essentially we want to find a\n",
      "curriculum automatic curriculum or path through the possible reward functions in\n",
      "which we can start to train a more and more General agent but then the\n",
      "interesting question is again like what exactly is the right notion of\n",
      "interestingness in order to drive that curriculum that path through the design\n",
      "space of possible things we could be training our model or agent on and um and\n",
      "that's I think one of the most interesting open questions and it relates to the\n",
      "question as well of how do we get the model to ask the questions um because\n",
      "really what drives humans in terms of asking further questions uh is our own\n",
      "implicit notion of interestingness which is informed by things like the\n",
      "scientific method and you know being able to create explanations about the world\n",
      "and we find things interesting when we can't actually explain some phenomenon\n",
      "about the world uh B based on existing theories or explanations and so I think\n",
      "what's really missing for a well-grounded you know human interpretable version\n",
      "of interestingness is having models that can essentially come up with their own\n",
      "theories about the world and start to probe those theories for where there's\n",
      "mismatch between you know the their learned theory of the world and evidence\n",
      "that new evidence that they find from experiences in the world yeah it's so\n",
      "interesting and and um I mean when I make the argument that agent should be\n",
      "physically and socially embedded it's it's actually quite a simple argument\n",
      "which is just the guard it it's that interestingness thing I think that that is\n",
      "how you know having um agency but with the guard rails of our physical and\n",
      "social embedding so you know we're we're sampling things that make sense because\n",
      "they're already there that you know but but but obviously we can go off piece a\n",
      "\n",
      "Chunk 16:\n",
      "little bit as individual agents I I feel that that that's what helps that\n",
      "process just coming back to suton it's entirely possible that I've misunderstood\n",
      "suton by the way so my my interpretation of of reward is enough and it might be\n",
      "true as you say that it's tautological given that if you already knew the reward\n",
      "function for particular environment then it could do everything that it needed\n",
      "to do but my interpretation of of reward is enough is that it would lead to um a\n",
      "general intelligence and you know General in the kind of magical sense that it\n",
      "would work in in any possible situation but if it is specialized in the way that\n",
      "we agreed earlier that there exists a a reward function which would in you know\n",
      "codify motifs and things that you know you need to know or optimize in a\n",
      "particular environment or set of environments then to me that's still\n",
      "specialized intelligence and I would great yeah yeah that's that I think that\n",
      "aligns with my take as well where I think if you have a reward function um it's\n",
      "already sort of applying uh largely applies to at least the examples in that\n",
      "position paper about reward is enough it seems like most of the reward functions\n",
      "they discussed are largely um grounded in a specific task and I think that if\n",
      "you have the reward function for a specific task then it definitely seems that\n",
      "you can have some optimization or learning algorithm that essentially learns to\n",
      "optimize that reward and therefore achieve that task um so I do think sort of\n",
      "the open question that uh it I think saying reward is enough I think it kind of\n",
      "passes the buck up further one level to the question of where that reward comes\n",
      "from and I do think that having systems that can automatically design\n",
      "interesting new rewards that seems like the frontier yeah I I agree and and you\n",
      "know to me intelligence is about discovering the knowledge and the knowledge is\n",
      "the reward function so it feels like kind of baking the knowledge in into the\n",
      "system um okay so another sort of Galaxy brain take is um I was talking to\n",
      "Bishop about this the other day and um do you think of like deep learning models\n",
      "as one model or do you think of them as a sort of like intrinsic Ensemble of\n",
      "models because they they get you they behave differently in an input sensitive\n",
      "way so you know like depending on the prompt you put into language into a\n",
      "language model you might find that like a different part of the weight space\n",
      "gets activated and essentially it's like retrieving a mini program and that\n",
      "program is being run but it's not it's not model building it's like model\n",
      "retrieving but would you agree with that H I guess I'm not sure about the like\n",
      "like within like subsets of a single homogeneous model but I guess the thing\n",
      "that I like to think about that's I think quite related to this is this idea of\n",
      "like and I think Yan Lon also kind of well a lot of people have laid out like a\n",
      "similar architecture as like you know should we think of intelligent agents as\n",
      "having kind of like separate subsystems that can maybe like be thought of as\n",
      "different your networks and so you know we could have like you know um the\n",
      "standard notion of a policy which is like outputting actions and maybe we also\n",
      "want to have the notion of like a prediction model more like a world model that\n",
      "predicts what might go ahead in the world as well as maybe like a planner that\n",
      "is somehow good at like optimizing in that model and so we could kind of think\n",
      "of all these things as like separate subcomponents that we assume an intelligent\n",
      "you know an intelligent thing would have like an intelligent thing should be\n",
      "able to predict ahead on the world it should also be able to Output actions it\n",
      "should hopefully maybe be able to infer like why other things happened and\n",
      "things like this and so I guess as to whether we think that should you know be\n",
      "just like one homogeneous model um for which maybe you query it and maybe you\n",
      "know different aspects of that model would kind of um you know handle different\n",
      "aspects of the query or that we should think of those as separate components I'm\n",
      "not really sure as to whether it matters whether they're separate components or\n",
      "not because yeah I agree that you probably could just have like one massive\n",
      "model that does all of these things and I think at least from the the trend that\n",
      "I've been seeing um in kind of the world models literature and and also just\n",
      "like I guess the RL lit or maybe just we should call it the foundation model\n",
      "literature is you kind of don't want to have like a a separate model that does\n",
      "the prediction for actions and a separate model that does the prediction of\n",
      "observations like why not just have one massive model that's jointly trained to\n",
      "predict everything you might want to query and then depending on the different\n",
      "\n",
      "Chunk 17:\n",
      "query you know it will just either predict an action or predict a video sequence\n",
      "or it can be condition on actions or condition on language so I think in this\n",
      "sense like this kind of model like you said is more like just one massive model\n",
      "but it kind of has like lots of different subtasks that it's able to do um and\n",
      "so maybe this is actually like the more effective way of training a model\n",
      "because then you kind of get generalization across these different subtasks as\n",
      "well yeah and the reason I'm asking the question is um it seemed I mean like you\n",
      "know for for an outside are coming in it looks like statistics is broken you\n",
      "know in the olden days we used to talk about the my fre lunch theorem used to\n",
      "say like you know you need to have specialized models for different situations\n",
      "and now the narrative is that we have generalist models we have Foundation\n",
      "models and and they are better than the specialized models in a strong sense and\n",
      "you know and I like to sort of push on this a little bit and see well when when\n",
      "does it break because we know that there are like these physics inspired models\n",
      "with inductive priors that you know know know about invariance of you know like\n",
      "molecules in drug Discovery and stuff like that and surely they would be better\n",
      "than a language model but no no no now they're training language models on\n",
      "mathematical conjecturing and like you know like um drug formula using tokens\n",
      "and so on so you know as an outsider you might just think well we can just use a\n",
      "big transformers model for everything I I think a lot of this does come from um\n",
      "well so I think the attention-based Transformer architecture is proven\n",
      "empirically to just be highly scalable highly effective at learning lots of\n",
      "different kinds of data distributions um but I think also part of it is just\n",
      "that we're just starting to enter this regime where we're just training these\n",
      "models on an insanely large amount of data and I think that a lot of times we\n",
      "need to sort of take a step back and really consider the amazing performances on\n",
      "different tasks and really think about you know how much information was\n",
      "actually leaked into uh this task in the training data because um right now\n",
      "we're really just training uh these huge models on I think I would say that\n",
      "we're largely training them on the test distribution in many cases um I do there\n",
      "I have seen like lots of examples of uh truly impressive behaviors from these\n",
      "models that that do seem like uh truly novel like zero shot General ization to\n",
      "unseen tasks uh like there was a recent example I saw on Twitter where someone\n",
      "uh had like a very low resource like rare language and they gave it a few gave\n",
      "they gave I think the cloud 3 Model A few examples and it was able to\n",
      "essentially perfectly reproduce uh new utterances in that language uh so that\n",
      "does seem very impressive um but it does seem at the same time you know a lot of\n",
      "the performances for example on elsat or like AP biology exams I imagine a lot\n",
      "of that is really a function of just uh literally giving the model uh the test\n",
      "domain in terms of information during the training step okay okay so there are\n",
      "like two schools of thought on this when we talk about world models you know\n",
      "people are talking about Sor and is it building a world model and and it\n",
      "certainly seems to be it seems to be doing I mean obviously it's not doing Navia\n",
      "Stokes it's not doing like fluid dynamics but but it seems to be doing something\n",
      "like that so like one one extreme view is that it it is just a hash table and\n",
      "you know it's it's kind of doing some diffused approximate retrieval or whatever\n",
      "another school of thought is that it's like a simulator and you know people talk\n",
      "about the simulator's view of large language models and you know like it's like\n",
      "it's modeling not only you know just just just the words and the language but\n",
      "it's also implicitly learned to model the world and the people and and all of us\n",
      "so that's the Spectrum I mean like Mark where where do you think these things\n",
      "are on that Spectrum yeah I think we like it would be great to be able to play\n",
      "around with it and kind of see what we can get out of it but I think I think if\n",
      "you can for example you know after each kind of you know so it's a language\n",
      "condition model so if after each kind of frame you could for example put in a\n",
      "different langu um language kind of conditioning and say like you know what\n",
      "happens here if you know the mug was pushed off the table instead of whatever\n",
      "else was originally happening in the video and so if you can basically do this\n",
      "kind of like counterfactual or like Interventional predictions where you kind of\n",
      "\n",
      "Chunk 18:\n",
      "give some new action and then you're able to see like the alternative outcome of\n",
      "that new action I think if the model's able to do that then I would think that\n",
      "it does have a pretty good understanding of how the world works in the sense of\n",
      "you know I really think like if you can predict the outcome of any action given\n",
      "some sequence of observations I do think that's a pretty good proxy for being to\n",
      "say if you can do that you really do understand how the world Works um and so I\n",
      "think if the model can do that I I would be kind of inclined to say that it does\n",
      "have like a kind of world model in the sense of understanding the underlying\n",
      "world but then there might also be a chance that you know you know these models\n",
      "aren't like you said it's more just like a diffuse retrieval and and perhaps if\n",
      "you try and do like a very fine grain conditioning on a slightly different\n",
      "outcome um different like conditioning maybe it won't actually give you the\n",
      "correct kind of counterfactual prediction and so I think maybe we'd have to see\n",
      "how good these models are at generalizing to to slightly different inputs and\n",
      "things like that to really see if it understands things well or it is just like\n",
      "kind of generating some arbitrary video yeah I think it's a double whammy\n",
      "because our colloquial use of language and like you know use of models and\n",
      "intelligence is so static that like you know we we we um we think of of that as\n",
      "being intelligence but but we're still going like we we're now create we're\n",
      "creating knowledge right now we're creating models because we're exploring we're\n",
      "doing exactly what you said mchi we're like we're exploring the search space and\n",
      "we're building models and we're combining them together and you know presumably\n",
      "would diverge quite quickly from from from the language models but I mean what\n",
      "what's your take on on this idea that they are you know potentially World\n",
      "simulators yeah um so just regarding the the sort of lookup analogy for these\n",
      "large models I I think it's so my mental model is similar to that um although I\n",
      "think it's it's very close to um I think a really good write up of of the of\n",
      "this alternative take um which is more like there's an alternative take which is\n",
      "that it is kind of like a lookup table but the prompt itself is a key that maps\n",
      "not to a specific sort of response but to potentially like a function aast space\n",
      "of functions and fr had a really good um sort of blog post where he kind of goes\n",
      "more into the details of this viewpoint but I think that that really you know\n",
      "resonates with my intuition of how these things behave where it's not literally\n",
      "looking up like um a key value in a hash table it it seems more like it's these\n",
      "models have learned over tremendous amounts of data to compress that data they\n",
      "have to learn um I think more abstract functions that helped to explain that\n",
      "data and therefore they're learning functions so they're approximating some kind\n",
      "of function uh or a vast family of functions and I think the prompt really acts\n",
      "like as a key that essentially activates a particular function and so you can\n",
      "kind of think of you know in the classical world where one neural network equals\n",
      "one function like basically it's mapping from images to image net labels now\n",
      "like Foundation model in the foundation model regime it's like one Foundation\n",
      "model is essentially kind of like a giant database of lots and lots of lot\n",
      "different functions um that's basically activated selectively based on the input\n",
      "or the prompt um and I I do think that you know based on this I think it's\n",
      "definitely possible that with enough data from the world enough experiential\n",
      "data that these Foundation models can learn sort of a basis set of Dynamics and\n",
      "transitions that explain how the world Works um and essentially if it does learn\n",
      "these transitions um for example in like the massive amount of video data that\n",
      "sore is trained on um I would say that yeah I I would agree that they are\n",
      "essentially starting to approximate uh World models sure yeah so yeah these are\n",
      "two um separate papers so so the first one being dreamer led by like Dan and jar\n",
      "half so this is um you know example of work in the space of world models and so\n",
      "basically what dreamer involves doing is like a way of training a world model\n",
      "and then also showing that you can just generate synthetic data in this mod\n",
      "model and then optimize decision- making like purely using the synthetic data um\n",
      "so we talked a little bit earlier about like partially observable mdps so we\n",
      "want to like take kind of the sequence of observations um and then be able to\n",
      "predict like the next a distribution over the next observation given some action\n",
      "and so so we also talked about how you might want to like compress this into\n",
      "\n",
      "Chunk 19:\n",
      "like a um more compressed representation of of the previous observation so\n",
      "basically what dreamer proposes to do and a lot of works on world modeling is to\n",
      "take your previous sequence of observations and then you map them to some\n",
      "compressed representation and then could predict ahead in this latent space um\n",
      "the next uh latent um latent State condition on the action and then yeah the\n",
      "really interesting about this is that now um you know we can in general predict\n",
      "what's going to happen to condition on different actions so now if you want to\n",
      "get like interesting Behavior out of something like dreamer you can then go\n",
      "ahead and generate a lot of synthetic data using dreamer or the dreamer World\n",
      "model and then use that to optimize behavior and so in Dreer basically the way\n",
      "it's done is by doing like on policy reinforcement learning in the world model\n",
      "so a lot of people call this like reinforcement learning in imagination so it's\n",
      "basically you know you're imaginating a bunch of synthetic data then using that\n",
      "to like use some standard reinforcement learning algorithm and then optimize um\n",
      "behavior in some sense um and then you could also do other things like Monti\n",
      "research which is like closer to like the works on on muso and things like this\n",
      "creativity is a little bit like a cloud and all the creativity only happens on\n",
      "the surface of the cloud so there's this interesting thing that like Creative\n",
      "Discovery depends on the history of all the things that are discovered before\n",
      "and typically like new discovery only happens at the end of the chain not back\n",
      "in in the middle tinkering exactly and and there's also this notion that\n",
      "creativity happens through knowledge so like knowled new knowledge doesn't come\n",
      "from The Ether it's kind of there's some creative component to it but it it's\n",
      "it's on the um the the trodden path of existing knowledge that we already have\n",
      "yeah that wasn't a very good question but so when when we talk about imagination\n",
      "through like you know like reinforcement learning policies and so on what we're\n",
      "saying is like you know I'm I'm imagining all of these like possible you know\n",
      "worlds and so on but I'm using the cognitive Primitives of all of the stuff that\n",
      "I already know yeah I think knowledge is definitely U compounding uh compounding\n",
      "um artifact uh that's basically like the culmination of everything all the\n",
      "experiences that we uh that we encounter like throughout our whole life and\n",
      "through also like Beyond you know going backwards Beyond like even our\n",
      "individual lives into like the cultural knowledge that's shared and uh what's\n",
      "really cool about language models is that they are essentially um a codification\n",
      "of cultural knowledge and so uh Jeff cloon has this concept of AI generating Ai\n",
      "and so he's got multiple pillars of essentially what it takes for uh you to have\n",
      "ai systems generate General AI systems and he recently added actually like as a\n",
      "fundamental piece of this in in his framework uh this idea of building on top of\n",
      "foundation models and so he says he calls it like standing on the shoulders of\n",
      "giant Foundation models um which is I think really just um sort of the ml\n",
      "equivalent of building on top of cultural knowledge there's there's a real shift\n",
      "recently towards talking about um synthetic data and as we were just saying like\n",
      "you know synthetic data doesn't come from the epha so we already know stuff\n",
      "about the world we we build simulators and we kind of generate new information\n",
      "but in the neighborhood of things that we already know and then we kind of like\n",
      "iterate and fine-tune on the generated data um what what what do you think about\n",
      "that process yeah no I think yeah maybe I'll bring it back to this like the plan\n",
      "to explore line of work so yeah um so so basically like the motivation of that\n",
      "kind of work is like kind of saying you know we might have some like previous\n",
      "data set or something and we've trained our world model on that data set but we\n",
      "really want to go out and like gather more data and then like improve the world\n",
      "model um um by gathering more data and so we can use things like intrinsic\n",
      "motivation to then give us like a reward signal within the world model so in the\n",
      "sense of something like prediction error which me mentioned earlier so now we\n",
      "can basically like train a policy in the world model that's now not trained for\n",
      "a specific task but it's trained to go out and gather information in the world\n",
      "so basically now you know you do this imagining in the world model to imagine\n",
      "aead but instead of imagining ahead how do I do a task well you're imagining\n",
      "ahead how do I get to states that I don't know what happens and therefore will\n",
      "learn more and that's basically like the motivation behind plan to explore um\n",
      "\n",
      "Chunk 20:\n",
      "and then in our um paper Waker it's it's kind of like inspired by plan to\n",
      "explore as well as works on like Auto curricular and so basically what we're\n",
      "trying to say is you know plan to explore is good for for getting an agent to go\n",
      "out and gather data um within a single environment and you know and presumably\n",
      "once you've gathered enough data within a single environment then you can\n",
      "generate a bunch of synthetic data in that single environment and then do what\n",
      "we discussed with dreamer in terms of like optimizing a policy for that very\n",
      "specific environment um but what we're really interested in is saying you know\n",
      "let's not assume that we have like one specific environment beforehand let's\n",
      "assume that you know there's some space of you know broad range of scenarios\n",
      "like we want a very like General agent there might be a bunch of different\n",
      "environments and then within that those different environments we kind of want\n",
      "to be able to to handle absolutely any task and so in the Waker paper we're\n",
      "basically saying like you know how should we gather the data within um within\n",
      "this like broad space of possible environments and tasks such that we can train\n",
      "a very good World model and then once we have that world model that's kind of\n",
      "like capable across environments and tasks you know the assumption is that we\n",
      "can then use that to generate good synthetic data which we can then um use to\n",
      "optimize behavior and so maybe to talk a little bit about like how we formalize\n",
      "this problem um so you know we mentioned earlier this idea of like the\n",
      "simulation Lemma so we basically say that or an existing work that says like in\n",
      "a single environment we can bound the gap between the optimal policy that's\n",
      "trained in the world model so trained in the synthetic data to the to the truly\n",
      "optimal policy by the error in the world model and the distribution of States\n",
      "generated by that policy so it's kind of intuitive like the world model should\n",
      "have you know low error and then we will get a good policy out of it but then\n",
      "what we're trying to say is like now let's assume we don't know what the\n",
      "environment is beforehand and we also don't know what the task is beforehand so\n",
      "how do we get like a good World model that can handle like all of those\n",
      "situations when we later want to go ahead and optimize some task um and so the\n",
      "way that we do this is we basically yeah we then use this notion of min max\n",
      "regret to say that the policy should have like low maximum regret across this\n",
      "entire space of environments and then using the simulation Lim we can basically\n",
      "say now now the um the world model has to have low error across all environments\n",
      "under the distribution of States generated by the optimal policy for any future\n",
      "task um so we're going to say like yeah the world model has to be good for any\n",
      "environment and under you know in any area that the policy might go to that's\n",
      "relevant to the Future tasks and then what we kind of say in the paper is you\n",
      "know if we want a truly General agent we're not going to know what the\n",
      "distribution of tasks is beforehand so we don't know we don't know what the\n",
      "reward function is we don't have a set of reward functions um you know we're\n",
      "just going to kind of assume the agent has to do anything later down the line\n",
      "and this is kind of like related to this idea of like open-endedness that we've\n",
      "talked a lot about and so if we don't know what the task is going to be like\n",
      "later down the line um then the best assumption we can do is say that you know\n",
      "it could be any reward function later down the line um which is maybe not the\n",
      "best assumption because as we talked a bit earlier if you're just kind of you\n",
      "know we talked about a bit about intrinsic motivation and interestingness and if\n",
      "you kind of assume the task can be absolutely anything later down the line\n",
      "you're kind of assuming that you know the agent might want to do something\n",
      "completely ridiculous later like it if you do this in robotics that might mean\n",
      "the task is just to do like back flips later or something like that but you have\n",
      "no interest in doing that so it's it's not clear if that's really a good\n",
      "assumption about how we should think about what tasks might be interesting later\n",
      "but that's the Assumption we make so we assume the task can be absolutely\n",
      "anything later down the line um so so now we have to get a to the point where we\n",
      "have the world model which is good for any environment and under the\n",
      "distribution of States generated for any task or any optimal reward function um\n",
      "and to do this we basically like Leverage two different techniques so to\n",
      "generate this state um so to handle the aspect that we don't know what the task\n",
      "\n",
      "Chunk 21:\n",
      "is later down the line we assumed that um we have an intrinsically motivated\n",
      "policy that's basically seeking out the maximum uncertainty in any single\n",
      "environment and so basically if if this um if this intrinsically motivated\n",
      "policy is seeking out the maximum certainty in every environment um it's kind of\n",
      "like estimating for us what the maximum uncertainty is in every environment\n",
      "because it's like actively finding uncertainty in every environment so now we\n",
      "have a policy that's finding like the maximum uncertainty in every environment\n",
      "and then if we want to um optimize this like Minimax Criterion across\n",
      "environments we kind of need the maximum uncertainty to be low across all\n",
      "environments so so we kind of have to have like um you know this policy isn't\n",
      "able to find like lots of big errors across all different environments um and so\n",
      "basically you know what we could think might might might what happened in\n",
      "practice is you know you could imagine there are a bunch of different\n",
      "environments some which are like low complexity and some of which are high\n",
      "complexity and if we just kind of naively sampled from those two different\n",
      "environments data you know our world model is going to very quickly get good at\n",
      "the low complexity environment and then it's going to leave lot more data from\n",
      "that high complexity environment to eventually get the errors low in the high\n",
      "complexity environment so to bring it back to the title of the paper which is\n",
      "weighted acquisition of knowledge across environments for robustness so the idea\n",
      "here is that we're basically going to change how we sample that distribution of\n",
      "data across environments to make sure that maximum uncertainty stays low across\n",
      "environments so what this ends up looking like is you know we're going to sample\n",
      "less data from the environment that has lower complexity and then we're going to\n",
      "actively sample more data from the environment that has higher complexity such\n",
      "that we we bring those errors down on the higher complexity environments and I\n",
      "guess this is a little bit different to existing works on curricular because\n",
      "normally in curricular like automatic curriculum learning you kind of assume\n",
      "that you have some reward function which is telling you how well the policy is\n",
      "doing in each environment and you use use that specific like metric of how well\n",
      "the policy is doing to determine um you know where the policy has more potential\n",
      "to learn but because we're making this assumption that you know we don't know\n",
      "what the reward function is we're we're trying to get a general agent that can\n",
      "kind of do any task any reward function um we don't assume that we know that\n",
      "reward function beforehand so we can't use reward as a metric of saying like I\n",
      "need more data from here or I need more data from here but then kind of the main\n",
      "argument of the paper is showing that you know if if we just think about this in\n",
      "terms of prediction era in the world model like we can actually use that as like\n",
      "an intrinsic motivation signal to say you know does the agent need to gather\n",
      "more data from this environment or from this environment without access to a\n",
      "reward function and so we could kind of think of um this work as kind of a more\n",
      "General approach to automatic curriculum learning in the sense of like we're not\n",
      "assuming that that you have a reward function beforehand we're kind of agnostic\n",
      "to what the task is um and because and to kind of distill that knowledge that's\n",
      "that's gathered without the reward function we use World model as a mechanism to\n",
      "like distill that knowledge because if you just like naively have an agent\n",
      "gathering information with a reward function um you know how do you how do you\n",
      "kind of put that knowledge into the agent and we kind of argue the best way of\n",
      "doing that is the world model um so that's kind of a summary of like the Waker\n",
      "paper and like what the ultimate algorithm ends up doing so I mean essentially\n",
      "you're doing a high entropy search so you're you're leaning into um areas of\n",
      "complexity and you're building a higher complexity model which goes against the\n",
      "grain of of the intuition of like oam's Razer that we should have simple models\n",
      "so you're you're almost deliberately saying no I I want I want to model the the\n",
      "complexity and and have more of that and then the other interesting thing is\n",
      "like from from a a curriculum learning point of view I think traditionally we\n",
      "did explicit curriculum learning and you know we might have some principles\n",
      "around having a monotonically increasing curriculum of complexity whereas here\n",
      "by leaning into um environments where we do worse on so we're selecting them\n",
      "based on prediction error we're actually implicitly getting a kind of\n",
      "monotonically increasing complexity which just happens to work really well yeah\n",
      "I I guess actually it actually almost ends up being in the opposite direction so\n",
      "so by leaning into the the the higher complexity environments more we're kind of\n",
      "\n",
      "Chunk 22:\n",
      "saying let's prioritize the harder environments more to begin with so let's like\n",
      "gather more data in in the higher complexity environments um you know cuz I\n",
      "guess in intuitively if you kind of want to be good across all environments you\n",
      "kind of need more data from the higher complexity environments and we don't\n",
      "really explicitly think about an ordering of going first from easy to hard um I\n",
      "guess that maybe there is a good something to look into there because you know\n",
      "like a lot of these Works go from low complexity to high complexity because it's\n",
      "kind of easier to learn an initial policy that can kind of do something in the\n",
      "low complexity environment and then you build up the complexity gradually um but\n",
      "I think that that idea is most useful when you know what the task is so you\n",
      "could imagine if the task is like Locomotion if it's walking you kind of want to\n",
      "First loan a policy that's able to walk on flat ground and then maybe gradually\n",
      "build up the complexity like add and bumps and then eventually it can walk on\n",
      "like very complicated terrain so it kind of makes sense to go from low to high\n",
      "complexity um but in this work we're focusing on Purely intrinsic motivation\n",
      "meaning that the policy is not trying to learn a specific task it's trying to\n",
      "just seek out um uncertainty and like reduce uncertainty and so we don't really\n",
      "have the the notion of you know you first need to be able to learn how to do\n",
      "something on an easy an easy environment and then towards harder environments\n",
      "because there is no specific task that we're trying to learn and so I think for\n",
      "this reason you know we wouldn't didn't really focus on this notion of moving\n",
      "from easier to hard environments that actually you know we're consistently\n",
      "samping more data from the hard environments and I guess I think this relates or\n",
      "I think this is something that you brought up when we when we worked on this is\n",
      "like you know I think we can really relate this idea to like a lot of different\n",
      "contexts including things like like language models for example um so you know\n",
      "you could imagine if I'm training in LM I don't really necessarily have this you\n",
      "know not really a reward function in some sense you're just trying to do like\n",
      "unsupervised prediction um and so you know we could for example take the\n",
      "prediction era of like a language model in a in a bunch of different domains and\n",
      "say you know the language model is not very good at predicting a language about\n",
      "some certain task or something like that and and you know we could say you know\n",
      "and intuitively the same thing kind of holds if it's not very good at predicting\n",
      "you know um what the next token is in French like we should presumably gather\n",
      "more data in French um and so that kind of gives us a way of like actively\n",
      "Gathering the appropriate data um and so yeah I think this idea of like\n",
      "Gathering more datab based uncertainty obviously is a very general idea like the\n",
      "idea of like Active Learning um but we kind of like specialize that into\n",
      "thinking about how do we think about this in terms of the reinforcement learning\n",
      "setting it might be um interesting to talk about as well like sort of because we\n",
      "looked at some of the metrics as well right the environment complexity metrics\n",
      "we don't have the external notion of difficulty but we we also did look at sort\n",
      "of the emergent uh curriculum yeah yeah yeah gotcha yeah so I guess um so it\n",
      "kind of dependent on the environment so in some environments you just kind of\n",
      "got this like very straightforward behavior of like you know consistently gather\n",
      "more data in the more complex environment um but because we're we're actively\n",
      "trying to gather data um of the the environments for which the uncertainty is\n",
      "the highest kind of this curriculum could change o over over the course of\n",
      "training so so what happened in some of the other environments for example is\n",
      "that initially all the environments are just like high un certainty like there's\n",
      "like all environments are kind of misunderstood therefore like sample all\n",
      "environments like equally more or less to just get a rough understanding uh and\n",
      "then you know as as the model would improve on the simplest environments then we\n",
      "would see like more and more emphasis towards sampling the highest complexity\n",
      "environments so I guess in that sense we would get something to more like kind\n",
      "of what you said in terms of like a standard curriculum but but a bit different\n",
      "in the sense of like initially everything is uncertain so we're just going to\n",
      "sample everything uniformly um but then we kind of get a better understanding of\n",
      "which of the environments you know the uncertainty remains high on these higher\n",
      "complexity ones and those are the ones we need to like go and gather more data\n",
      "yeah I mean I I can see this both ways I mean certainly from a like a basian\n",
      "\n",
      "Chunk 23:\n",
      "optimization point of view that there's something to be said for um you know\n",
      "this is where I'm uncertain go and gather more data where where I have highest\n",
      "uncertainty and as you say like traditionally in curriculum learning we are told\n",
      "that we need to have monotonic increasing complexity but as you just said that's\n",
      "when we have a particular task in mind now neuron networks they're a little bit\n",
      "like a block of clay aren't they so you know it's it starts off with abject\n",
      "complexity and then we do stand you know we do um stochastic gradient descent\n",
      "and we chip away at the clay and we kind of build we sculp a statue that that\n",
      "that we want to build and I'm just trying to get an intuition here so like with\n",
      "with this maximum entropy um search you know like high entropy search what we're\n",
      "doing is is we're saying okay well here are some complex models and these models\n",
      "must contain motifs that tell us a lot of information it's a little bit like the\n",
      "ELO algorithm in chess you know you actually get Information Gain when something\n",
      "surprising happened so here's a big um block of complexity and I'm going to try\n",
      "and infer what the motifs are in that complexity that that explain the\n",
      "information that I'm missing I think that um a lot of this ultimately traces\n",
      "back to sort of there's like this like fundamental pattern towards uh I think\n",
      "that like ties a lot of these ideas around active um active experiment design or\n",
      "like active sampling which is and all all these autoc curricular methods which\n",
      "is you essentially want to devise uh what you know nowadays we call a\n",
      "self-supervised objective or self self-supervised training algorithm um where\n",
      "essentially you have the system essentially use signals it produces itself um\n",
      "during the training or evaluation process in order to drive itself forward in\n",
      "terms of deciding what future data to train on and so you know we sometimes call\n",
      "these kinds of systems Auto curricula as well because it's automatically\n",
      "generating this curriculum of tasks to train on and I think the sort of like the\n",
      "fundamental connecting um uh pattern here is just that this the signal that we\n",
      "use to drive the training it's always going to be based on something like uh an\n",
      "uncertainty signal or um going back to the open-endedness literature something\n",
      "like a classic notion of interestingness and I think there's just a lot of\n",
      "different possible choices for this metric and so one for example we talked a\n",
      "lot about Mini Max grab so regret could be one of these driving signals because\n",
      "it measures the existence of a performance Gap and therefore probably an\n",
      "information Gap as well in terms of learning to master those tasks with high\n",
      "regret um but also uncertainty is also another one it ties back to novelty\n",
      "because novel environments you will be more uncertain within and so there's\n",
      "fundamentally lots of different sort of branches of these autoc curricula that\n",
      "you could use depending on this search objective that you use to drive this\n",
      "exploration process can we contrast this to you know like large language models\n",
      "they are self-supervised learning so you know we we do this self-supervised\n",
      "objective you know which is like you know typically predicting the next word and\n",
      "it's a similar thing with um self-supervised um image um learning now the\n",
      "difference is with that is you're talking about a principled way of you know\n",
      "seeking specific information you know with um let's say high high entropy and\n",
      "that would lead to an imp curricular whereas with language modeling language\n",
      "modeling there is no implicit curricular but I might argue that there kind of is\n",
      "because the way the model does this continual learning um it might regularize\n",
      "itself so if you give it sort of surprising and weird information the language\n",
      "model might just kind of brush it off and if you reinforce things that it\n",
      "already knows then it's almost like a a stream of channels you know it'll say\n",
      "okay you know go go and go and and pay attention to that so it's almost like\n",
      "it's implicit yeah and I would say that in some ways it's almost explicit in\n",
      "terms of how we design these systems um a lot of times like if you look at uh\n",
      "for example open ai's job listings they're actually hiring specifically for\n",
      "experts in different domains to essentially create the next batch of supervised\n",
      "data to train or instruction tune their models on uh for example they hire\n",
      "biologists or they hire people with legal expertise to generate this data and um\n",
      "you can think of this essentially as a human steered or human driven version of\n",
      "this active sampling process right because it's essentially they know that the\n",
      "model uh tends to get high perplexity or they don't it doesn't perform as well\n",
      "on this domain of tasks it doesn't get as high of an LSAT score as it could and\n",
      "\n",
      "Chunk 24:\n",
      "so you can essentially you know it's it's beyond an algorithm at this point\n",
      "right it's kind of the super algorithm where you have the system designers now\n",
      "also being part of the data collection process and in a way um supervised\n",
      "learning is really just sort of one point in a continual learning process where\n",
      "you know classically we just looked at one step of this which is here's a batch\n",
      "of data train on that but really um building machine Learning Systems especially\n",
      "nowadays everything's in production these are all live systems you have have to\n",
      "keep it up to date you have to keep it continually generalizing to new knowledge\n",
      "um like Chachi PT or Claude or Gemini and so really it's sort of this pattern\n",
      "over and over again in sequence where you collect a batch of data train your\n",
      "model on that collect the next batch of data you know continue training your\n",
      "model on that um and really you want to be selective about what the next batch\n",
      "of data is because obviously if you just retrain it on the previous batch of\n",
      "data um it's going to overfit to that data uh Beyond a few epochs or uh it's not\n",
      "going to you know get as much novel information from it just because it's\n",
      "already trained on it so you do want to selectively actively collect the data\n",
      "and so I think we kind of almost explicitly already do this at a systems level\n",
      "um and I think the next Frontier is really just having systems that self-improve\n",
      "in this way where they can start to guide more of their own active data\n",
      "collection I love this way of thinking about it you know like gbt for is a mtic\n",
      "intelligence it's not just like you know a bunch of weights on on a on a on a\n",
      "server somewhere and so you could argue you know there this concept called\n",
      "graduate student descent which is what happens in in Academia or even as you\n",
      "just articulated with open AI it's a little bit like an epic Mechanical Turk\n",
      "right where you know um they are monitoring the logs they know when things go go\n",
      "badly and then they lean into it in the same way you are they they go in higher\n",
      "experts and they kind of like add more and more data in all of the holes and\n",
      "eventually there are no more pockets of like abject failure it just it just\n",
      "appears to work really well for everyone and people start to say that it's you\n",
      "know generally intelligent so yeah so there's there interesting systems if you\n",
      "of of of intelligence yeah it kind of starts to mimic just the scientific\n",
      "process in a way uh where we're sort of we we're putting a lot of Hope in the\n",
      "model to basically be able to distill uh information from sort of the net new\n",
      "batch of data that we collect um you know that we know the model currently\n",
      "doesn't explain well and we we we put a lot of faith in gradient descent in\n",
      "order to basically be able to come up with updates to the weights that better\n",
      "explain that data so we're kind of we're kind of already treating this system as\n",
      "almost like an automated um scientist or an automated version of this like\n",
      "continual process of creating theories and explanations about the world um but\n",
      "of course you know um humans are still much better at language models at doing\n",
      "this uh or large models at doing this so I do think there clearly seems like a\n",
      "huge gap in terms of what we still of work that needs to be done in order to\n",
      "build systems that can actually build much more robust theories uh based on like\n",
      "net new data and even seeking that out as humans do interesting and and\n",
      "certainly you know in this broader mtic intelligence we are still the sources of\n",
      "a gency but um we we were just sort of talking a minute ago about there being\n",
      "two types of AI you know that there's there's an AI where we are the generating\n",
      "sources of agency but there might potentially be another AI in the future where\n",
      "that that is the generating source of agency yeah I I so I think that um this\n",
      "kind of ties into my my the framework I personally use to think about open-ended\n",
      "systems as well uh where I think that you know at a high level you can you can\n",
      "study AI sort of in silicone you can study it in systems that you control that\n",
      "you design and that you try to like have the AI model self-improve within and so\n",
      "you can try to build uh systems that self-improve within silico and that's going\n",
      "to lead to potentially some issues around like the grounding problem where\n",
      "essentially it starts to the auto the autoc curricular exploratory process\n",
      "starts to Veer into Parts pockets of the design space that are not relevant to\n",
      "tasks you care about um and so this kind of the danger of like generating\n",
      "open-ended systems in silico and I think it's very similar to potential dangers\n",
      "\n",
      "Chunk 25:\n",
      "of generating AGI in silico um and I think the alternative is really just what\n",
      "are existing intelligent systems and how do we actually amplify the efficiency\n",
      "the efficacy of those systems the intelligence within those systems and so you\n",
      "can kind of think of like sort of the entire Enterprise of AI research as do we\n",
      "want to generate like AI or intelligence from scratch or do we want to build\n",
      "tools uh you know motivated or inspired by human intelligence and other\n",
      "intelligence systems and use that to further amplify existing intelligence like\n",
      "human creativity human intelligence could you argue because if intelligence is a\n",
      "Divergent search process you might be tempted to think that well if we had loads\n",
      "of tools to help us share the models and help other people discover the models\n",
      "that I've created and that will help us generally be more intelligent but could\n",
      "you make the counter argument that I'm actually sequestering agency or stealing\n",
      "agency from other people because rather than thinking for themselves and\n",
      "discovering novel models they're just going to use my model yeah I mean I think\n",
      "that in the best case scenario you're Building Systems that essentially you know\n",
      "not not you know to to think about how you know as existing systems nowadays can\n",
      "build on the shoulders of foundation models you really want the to build models\n",
      "where even humans can stand on their shoulders where the humans can basically\n",
      "leverage the existing expertise or automative capabilities of those models to\n",
      "then like move further beyond what they're naturally capable of doing and really\n",
      "that pushes the frontier of the knowledge that we can create as a civilization\n",
      "and so you're already starting to see this where there's some recent studies\n",
      "that show for example like Junior software Engineers that use systems like um\n",
      "chaty BT to help them with coding at work they actually now are starting to\n",
      "match the performance of more senior Engineers uh because it sort of levels the\n",
      "playing field but that also translates into just like uh net more productivity\n",
      "per software engineer and so um I think that it's more just unlocking sort of uh\n",
      "existing bottleneck and how productive each individual can be and really just\n",
      "means that each individual can create a lot more value can discover a lot more\n",
      "knowledge um than before okay but I mean do do you think that it creates a\n",
      "tendency towards boilerplate though so we're more we're more efficient at doing\n",
      "things that exist but you know like on on the frontier we might have a Slowdown\n",
      "there's definitely the danger that it can lock you in to certain patterns right\n",
      "so basically if chpt always returns a certain boiler plate that might have an\n",
      "anti- pattern in it um if that stays around it could self- amplify and then\n",
      "future generations of programmers might just adopt that by default because it's\n",
      "what's already generated by autocomplete so I think that that's also another\n",
      "really interesting realm of questions which is basically how do you um how do\n",
      "you avoid these kinds of uh these local Optima when you start to train a model\n",
      "on its own outputs and I think again like sort of the solution will start to\n",
      "look like some form of novelty search or exploration makes sense okay um what do\n",
      "you guys think about like um you know acade Academia versus industry and um some\n",
      "say there's a bit of a brain drain from Academia totally yeah I think there's\n",
      "like a very very clear trade-off between the two in the S they both have like\n",
      "fantastic things going for them and I guess the trade-off being you know\n",
      "academic freedom and Academia and be able to like individually pursue ideas like\n",
      "purely for curiosity sake and um you know that's something I've really loved\n",
      "about Academia but I guess you know I guess the general Trend and and machine\n",
      "learning research at the moment is kind of towards like larger scale projects\n",
      "especially you know a lot of the properties that we might want to see kind of\n",
      "only emerge when you expend a lot of compute and therefore you know a lot of\n",
      "interesting research can kind of maybe not only be done on an industry but it's\n",
      "a lot easier to do some kinds of research in industry and so I think this kind\n",
      "of leads this trade-off of do you want freedom or do you want to be on these\n",
      "like larger projects that are potentially more impactful and so yeah I've really\n",
      "struggled with that trade-off I think they they both have big pros and cons I\n",
      "don't know what you think mie yeah I I think that um industry is I I think like\n",
      "at a very like first word rough approximation would be to say that industry\n",
      "focuses much more on um exploitation and Academia is where you know in principle\n",
      "you should get a lot more exploration um but I I do think that currently uh both\n",
      "systems are kind of like entwined in the same sort of reward function at a high\n",
      "level where essentially um you know if if if you're if you care a lot about\n",
      "\n",
      "Chunk 26:\n",
      "citations and a short-term greedy algorithm for maximizing C citations would be\n",
      "to focus your research efforts on uh sort of whatever topic is uh trendy or\n",
      "hyped at the current time and so like I think you see tons of people obviously\n",
      "working on language models partly because it really is a fascinating subject and\n",
      "it really is like the most powerful form of deep learning we have so I\n",
      "understand why everyone's working on it but I also think that um a lot of it is\n",
      "kind of you you do get this sort of Rich gets richer effect around different\n",
      "topics that people tend to gravitate towards and you lose a lot of the\n",
      "exploration that you should otherwise have um and it's partly because you know\n",
      "like both industry and Academia are at some level optimizing for a similar um\n",
      "sort of reputational status or citation count sort of metric um and so I think\n",
      "that's an issue but I also think that in some ways uh industry you could say has\n",
      "a additional benefit where I do think that from like a short-term point of view\n",
      "industry is better poised to make certain um higher impact research not just\n",
      "because of the resources available to industry but also partly because um sort\n",
      "of Industry uh you know rides or dies based on whether the actual research\n",
      "artifact you produce uh is useful and so I think that's like a very powerful\n",
      "reward function that is not necessarily true for Academia um and then sort of on\n",
      "the to take the counter position I think Academia obviously you know you have a\n",
      "lot more freedom to just explore ideas that don't need to be on that critical\n",
      "path for Value creation immediately and so gives you a lot more scope to\n",
      "potentially find like the next big thing and so I think really it's about like\n",
      "if you want to if you want to take the bet that you can you know play a part in\n",
      "disc the next big thing then and that's that's suited to your taste for research\n",
      "then Academia makes more sense uh but if you know um you want to you want to\n",
      "maximize the probability you'll have a higher impact in sort of like a near\n",
      "horizon line of work then industry is definitely I think a better bet Rich\n",
      "Sutton you know he he had this bitter lesson essay and he made the argument that\n",
      "it's just all computation and there are no shortcuts and you can even think of\n",
      "you know maybe we're not very intelligent um Evolution has just been running for\n",
      "a very very long time and we are the result of that so in in a sense do you\n",
      "think that we could make strides in intelligence you know just through Ingenuity\n",
      "or are we always going to need loads of computer power this definitely like\n",
      "makes me think of like the recent Trend that we've been seeing even in like kind\n",
      "of the reinforcement learning literature lately which is like these kind of\n",
      "large scale like mostly industry projects that are kind of they're even ditching\n",
      "the idea of doing like sequential decision Mak so you know you have all these Al\n",
      "that are like you know optimal planning and so forth but we're kind of seeing a\n",
      "trend towards you know even ditching that complexity of algorithm and just going\n",
      "straight to just copy what the human did and so kind of reducing the problem to\n",
      "you know essentially no real algorithmic um Innovation and more just like can\n",
      "you gather enough expert data and I think yeah I guess the reason why that trend\n",
      "is occurring is is I guess like you said there kind of been you know the B\n",
      "lesson kind of said that you know just being able to SC with more data and more\n",
      "compute as kind of the most important thing and a lot of the more complex\n",
      "algorithms especially around like reinforcement learning are actually like quite\n",
      "challenging to scale up especially like online reinforcement learning if you\n",
      "want to go out and like actually have an agent like actively collecting data in\n",
      "a bunch of different environments and updating itself online like that's so much\n",
      "like engineering infrastructure to set up and so I think there's this this trend\n",
      "towards just like the simplest algorithm possible which is like not even\n",
      "reinforcement learning not even planning just copy an expert but I think that\n",
      "that's like you kind of said um earlier with like this kind of like short-term\n",
      "exploitation I think this is you know it it kind of makes sense to exploit this\n",
      "now and push it as far as possible because you know it's very easy to just train\n",
      "a large Transformer and then gather as much data as possible and I think in\n",
      "areas like robotics we haven't really seen like how far can that go like can you\n",
      "actually get a generally useful robotics platform just by gathering more expert\n",
      "demonstrations and training a larger and larger Transformer and so I think it\n",
      "does kind of make sense that why like a lot of Industry projects are pursuing\n",
      "\n",
      "Chunk 27:\n",
      "that because we don't really know you know will will that actually hit a\n",
      "bottleneck or or if you just gather enough data will that will that kind of be\n",
      "sufficient and I guess like you know you could argue that I think it's probably\n",
      "true that there must be a better algorithm out there that that can in principle\n",
      "do this in a more efficient way but I guess if it's just easier to just gather\n",
      "more data and just do imitation learning I can see that there's at least a\n",
      "business case for trying that um so I guess I'm on the the opinion of like you\n",
      "know there must be a more efficient way of getting to like a more intelligent\n",
      "system but it's not necessarily clear that just scaling like raw supervised\n",
      "learning or unsupervised learning like won't get you there and so it it does\n",
      "make sense to pursue that first but kind of what I hope and expect to see is\n",
      "that eventually pure imitation learning or pure unsupervised learning will kind\n",
      "of run out of steam and everything will Plateau and I think at that point you\n",
      "know then these like more complicated algorithms about Gathering more data\n",
      "reinforcement learning planning Etc will really come into their own and so I\n",
      "guess this again relates back to like the Academia industry trade-off like you\n",
      "know a lot of the products in Industry are just going to kind of be exploiting\n",
      "gathering data right now whereas maybe there's a lot of scope to do these kind\n",
      "of more expor exploratory projects where maybe that will get you to like the\n",
      "next Frontier a few years down the line um I don't know what you think about\n",
      "this yeah I definitely think that um yeah just like treating everything as just\n",
      "supervised learning it does tend to work because we have large data sets but um\n",
      "I think again like the challenge is just at some point we will run out of tokens\n",
      "we'll run out of data to train on um and so that's why these self-improving more\n",
      "self-exploratory systems will be more and more I think Paramount to like driving\n",
      "performance even further so if we want to sort of break Beyond sort of the token\n",
      "limit of like the data that's available now we actually need these systems to\n",
      "generate their own tokens their own synthetic data um and that's that's where\n",
      "like the self-play autoc curricular exploration types of algorithms will start\n",
      "to um become more and more prominent and obviously you need an environment in\n",
      "which to do that exploration and that's where the world model um line of\n",
      "research is going to be very powerful just because uh that allows you to really\n",
      "sort of milk all of the value within uh the existing previous data you have seen\n",
      "by creating these Role Models where you might be able to do like counterfactual\n",
      "trajectories and really learn much more um amplify the existing data you had\n",
      "yeah I mean I think one of one of the key things for me um is modeling Dynamics\n",
      "so um it's quite interesting actually with the human knowledge thing so even\n",
      "looking at the Innovations from from Deep Mind you know early versions of of\n",
      "alphao were bootstrapped with human knowledge and then there was the alpha zero\n",
      "so it was actually doing what we're were talking about it was actually\n",
      "discovering knowledge on its own and um in principle that's a great idea but of\n",
      "course like any restricted domain it's tractable but in in the real world it\n",
      "isn't and I'm not sure whether it makes sense to use the the computation and met\n",
      "you know information metaphor for the real world and humans and so on but but\n",
      "the basic idea is that we are all real agents the universe is a massive computer\n",
      "we're discovering all of this knowledge and then we're bootstrapping that into\n",
      "um a machine learning algorithm and then the question is well if you kind of\n",
      "just capture the thing now without the Dynamics that produced it um will the\n",
      "system be robust and could you still um you know kind of carry on as we were in\n",
      "the real world if if that makes sense so um but yeah the interesting thing with\n",
      "the work you've done is is that you are modeling agential systems and you are\n",
      "modeling Dynamics but could that be used for you know much more complex tasks\n",
      "like the real world like simulating much more complex systems in the real world\n",
      "exactly yeah I think that if you if you so I think that just purely imitation\n",
      "learning alone is not really going to get you there um but I think that if you\n",
      "can if you can uh imitate so one is sort of finding the set of tasks I think\n",
      "that uh if you find the set of tasks or reward functions that could be relevant\n",
      "then you can start to simulate things that are otherwise really hard to capture\n",
      "by just purely imitating historical trajectories so for example strategic\n",
      "\n",
      "Chunk 28:\n",
      "adaptation type of behaviors are really hard because those are sort of an\n",
      "open-ended space of behaviors where if you basically have like a stock market\n",
      "for example that's a really good example where if you have a stock market that's\n",
      "a very open-ended system and like different Traders will have different\n",
      "strategies that are best responses to each other and then over time the set of\n",
      "strategies evolves over time in an open-ended way um you know trading strategies\n",
      "that worked 10 years ago probably won't work very well today because people have\n",
      "sort of um they they've sort of figured out uh those strategies and so they\n",
      "won't be very competitive and so um I don't see an IM an imitation learning\n",
      "system being able to sort of um generalize to that level of complexity just\n",
      "because by definition it's imitating previous uh trajectories and therefore\n",
      "strategies so I think you need some of like a um a more uh more interactive\n",
      "trial and error learning that allows for strategic adaptation and that requires\n",
      "some notion of a payoff or a reward and so you kind of need to have this this\n",
      "idea of um you you can't just purely I think learn uh a model of something like\n",
      "the stock market just based on previous data you really need to have more\n",
      "inductive biases around uh sort of you know what creates a payoff or what the\n",
      "actual reward function is for each of the Traders uh but that might be something\n",
      "that you could um you could learn over time but I but maybe not in the yeah so\n",
      "sry this is kind of like this not very coherent but I feel like uh you might\n",
      "need something that looks more like learning over a space of programs that\n",
      "starts to Encompass different kinds of uh tasks and then you can basically\n",
      "simulate those tasks to completion with agents that can essentially uh try to\n",
      "self-improve against other agents the stock market I think is a wonderful\n",
      "metaphor for what we're talking about and for for two reasons first of all from\n",
      "the grounding reason because you know like the the the mtic world is very\n",
      "ungrounded and that's why we develop as humans lots of weird shared delusions\n",
      "about things because it's actually like you know it can go in it can go in\n",
      "almost any direction and also the concept of alpha I think is really important\n",
      "because a trading strategy works really well today and then when other people\n",
      "learn about it it no longer provides an advantage because everyone else knows\n",
      "about it and I feel it's the same with language models so you know like GPT 4\n",
      "Pros was really novel and cool it it was great to you know have like a TED Talk\n",
      "speech when it came out and now it doesn't seem cool anymore because everyone's\n",
      "using it on LinkedIn so it's almost like that that we need to have this in like\n",
      "continuous creative evolving process producing new sources of Alpha and the\n",
      "Paradox is that if everyone has access to the same model it can't be a source of\n",
      "alpha by definition yeah I guess on that like topic because we kind of talked\n",
      "about like synthetic data earlier and you kind of said like you know one one\n",
      "mechanism towards getting like a kind of self-improving system that is able to\n",
      "kind of you know continue to improve is to kind of like fil so the synthetic\n",
      "data for example so we might kind of you know have the the new system and then\n",
      "we generate some more data and then we kind of have some like filtering\n",
      "mechanism to say that you know in the current stock market this is this is good\n",
      "data or what you know whatever system we're thinking about and and then we can\n",
      "kind of like use that to enable the model to improve um you know and adapt to\n",
      "the new system but something I've always like like thought about is like or I\n",
      "guess one is is it really trivial to be able to like filter that you know new\n",
      "synthetic data and then two it feels like if you're just relying on like fil\n",
      "filtering existing synthetic data like is isn't that inevita going to kind of\n",
      "plateau and so I guess event you know we talked about how you kind of said that\n",
      "you do actually actively not need to go out and get real more real data but I\n",
      "guess I'm kind of asking you do you think this idea of just like filtering\n",
      "synthetic data from a model is kind of sufficient to always be able to adapt and\n",
      "improve or is it always going to be a mixture of like more real data plus\n",
      "synthetic data filtering I think it's the letter just because um at some point\n",
      "you would expect that uh the synthetic data you do generate it'll start to sort\n",
      "of uh saturate like what's already in the model um just cuz the model is trained\n",
      "on a finite amount of information so at some point you're just going to start to\n",
      "\n",
      "Chunk 29:\n",
      "see more and more um especially like the more likely trajectories or sequences\n",
      "of samples you'll start to see that uh more and more and so you're not really\n",
      "going to be very sample efficient in terms of searching for the synthetic data\n",
      "so can can you tell us about the results of the paper totally Yeah so basically\n",
      "we evaluate this um this algorithm on a bunch of like synthetic simulated\n",
      "domains kind of like robotics related tasks um and kind of yeah environments\n",
      "where there's like varying levels of complexity so you know you might have a\n",
      "robot pushing around a variable number of like objects or maybe you have\n",
      "different terrain that the robot might want to um um learn to kind of you know\n",
      "do Locomotion over and things like this um and so kind of you know the main\n",
      "comparison we make is like how well does Waker work relative to like naive\n",
      "domain randomization so how well does it work if you just like uniformly sample\n",
      "the space of environments versus if you do actively seek out the environments\n",
      "that have this like higher uncertainty um and so basically what we show is that\n",
      "you know if we do the Waker approach we still do like very well on average but\n",
      "we consistently do better in terms of robustness and so robustness by robustness\n",
      "I mean here that that we do better in terms of the worst environments um that\n",
      "the agent is evaluated under and so this kind of means you know if the agent is\n",
      "able to do well in the worst environments that it that it is evaluated under\n",
      "that kind of shows that it's able to do well across all environments because its\n",
      "worst performance is still good um so you kind of this this shows that we we\n",
      "achieve this like robustness property um which we talked about in terms of like\n",
      "Mini Max regret um but we EV we don't evaluate it in terms of like the true\n",
      "notion of Minimax regret because as as we talked about earlier actually\n",
      "evaluating regret exactly is difficult because that that that requires knowing\n",
      "the exact true Optimal Performance um which isn't something we can really know\n",
      "so instead we we just show that you know the agent performs well across all\n",
      "environments more so than if you just like naively sampled the environments\n",
      "uniformly and in terms of decomposing the performance across the spectrum of\n",
      "possible environments so like you know the ideal situation is that we have a\n",
      "very simple model which just generalizes so we happen to have found the golden\n",
      "Motif you know there's a spectrum of correlations almost all of them are Furious\n",
      "but we've just you know just by through some sheer magic we found the best Motif\n",
      "to work in all situations probably that's not quite true probably there are some\n",
      "good generalizing motifs and the model has also kind of like memorize the long\n",
      "tail and that there's some degree of like you know it works really well on on\n",
      "the test set but might not out of domain distribution do you have any like way\n",
      "of reasoning about what that is um so yeah I agree I guess there's like not\n",
      "necessarily it's not necessarily the case that by like focusing more on these\n",
      "like longtail examples that's necessarily the best way of of training the best\n",
      "model because like you said like maybe it happens to be the case that if the\n",
      "model is trained on some certain subset of the task like that will actually\n",
      "generalize better but but I think in practice that's not something we can really\n",
      "um really know how to you know like optimally select the the best kind of set of\n",
      "tasks that will generalize well and so we do focus more on on like you know\n",
      "these these kind of longtail tasks like the ones that we might see rarly and\n",
      "therefore have high uncertainty about um in terms of like the the out of\n",
      "distribution generalization so so we do also do some experiments like looking at\n",
      "how well does the model generalize out of distribution um and basically what we\n",
      "show is that if we train the model in this way and then we give it some more\n",
      "environments that hasn't seen at test time um if the environments are more\n",
      "complex then then we've seen sorry hasn't seen it training time basically like\n",
      "this model then generalizes better to outof distribution environments that are\n",
      "like more complex which is kind of what you'd expect cuz we've kind of bias\n",
      "something towards more complexity we're able to generalize better to out of\n",
      "distribution environments that have higher complexity and then guess the\n",
      "question is like do we care about out of distribution environments that have\n",
      "higher complexity like what about the out of distribution environments that have\n",
      "lower complexity um and I would argue that you know basically the lower en out\n",
      "dist distribution environments that have lower complexity like we would already\n",
      "expect that the model is able to do very well at so so there's not really much\n",
      "of a difference there because you know almost any reasonably trained model can\n",
      "\n",
      "Chunk 30:\n",
      "handle the very simplest environment so what we really care about is can we\n",
      "generalize out of distribution to like higher complexity environments and so by\n",
      "biasing the suling towards the higher complexity environments we do show that\n",
      "we're able to generalize further out of distribution to even higher complexity\n",
      "environments okay but is there any way of knowing whether it's kind of like\n",
      "memorizing the high complexity instances or whether it's still learning abstract\n",
      "motifs and generalizing between them yeah that's a great question I think that's\n",
      "a really interesting question generally for ML as a field right now which is\n",
      "better evaluation benchmarks for uh generalization within different kinds of\n",
      "models um and like we we alluded to earlier there's kind of this uh issue of\n",
      "data leakage between training and test set which is um which is definitely an\n",
      "issue that is currently happening with large language models um it doesn't take\n",
      "away from the impressiveness of these models CU clearly there is a strong\n",
      "generalization aspect to their behavior but I do think that in terms of\n",
      "measuring performance on specific benchmarks um we really need to solve this\n",
      "problem how do we have these clean data sets uh that allow us to to truly test\n",
      "on inputs that the model hasn't seen at training um I think in the case of uh\n",
      "reinforcement learning um that's a bit more difficult just because usually we\n",
      "focus on a particular task domain and so there's always going to be some shared\n",
      "similarities within task but obviously uh we didn't do this in this paper but we\n",
      "could try things where we have more um more controlled uh settings where we you\n",
      "know change one aspect of the environment and really uh see if it's learning\n",
      "specific causal relationships between um things that have to be accomplished in\n",
      "that task uh but we didn't do that um that actually think would be a really\n",
      "interesting idea for uh a new evaluation environment for RL yeah I mean the\n",
      "benchmarks thing is just a huge challenge in in machine learning um in general\n",
      "but just just to kind of round off off the interview I mean M you were talking\n",
      "about you're doing some work with um Ed grinstead and he's an amazing guy I'm\n",
      "get getting Ed back on and um you said that um been looking into this kind of\n",
      "the interface between humans and and machine learning can you tell me about that\n",
      "yeah so just to not say too much about it because um it's related to current\n",
      "work that's happening at Deep Mind um is just that you know I think from\n",
      "personally from a high level point of view I'm very interested you know talking\n",
      "about this divide sort of this fork in the road in terms of what's the path to\n",
      "open studying open-endedness studying it in silico or studying it in situ in the\n",
      "setting of an actual open-ended system like a user um app interaction or you\n",
      "know the interaction between a user and a piece of software on the web uh or\n",
      "potentially with many other users there are such Rich um existing systems online\n",
      "that are already open-ended because they amplify or connect the creativity and\n",
      "knowledge of humans um to create more knowledge and more creative artifacts and\n",
      "so I think what's really uh interesting in my mind now is sort of studying uh\n",
      "systems or algorithms that allow us to better steer the creativity of humans uh\n",
      "as they are uh mediated by software um and basically allow us to essentially\n",
      "amplify existing intelligent or creative systems that are open-ended so amplify\n",
      "existing open open-endedness rather than try to build it from scratch amazing\n",
      "guys it's been an honor to have you on MLS thank you so much thanks so much yeah\n",
      "great cool yeah we're done\n",
      "\n"
     ]
    }
   ],
   "source": [
    "transcript = \"\"\"\n",
    "open-endedness is essentially you know we're studying systems that can generate\n",
    "their own data in an infinite uh capacity and so it's systems that essentially\n",
    "if you run it for longer and longer they get more and more complex they generate\n",
    "more and more quote unquote interestingness or interesting data um and so if we\n",
    "can actually you know crack this snut of how do we actually come up with a\n",
    "self-improving system in the sense that keeps generating interesting data uh we\n",
    "can then use that data to train further train our models but of course you get\n",
    "into this perpetual ual uh data machine type of uh idea where obviously you know\n",
    "there's how do you generate more data uh if you know the data is ultimately\n",
    "coming from a model that you probably trained on previous data how do you get\n",
    "net new information from that well I think a lot of this is actually just\n",
    "resolved purely again going back to this idea of the reward function right or a\n",
    "preference function where there is outside information coming in through some\n",
    "sort of filtering criteria for example human designers in the loop uh or\n",
    "designers designing some sort of preference model that could essentially automat\n",
    "Ally rate the kinds of automatic uh data that's being generated by these\n",
    "open-ended systems what does Waker stand for right so Waker stands for weighted\n",
    "acquisition of knowledge across environments for robustness fantastic and what\n",
    "was the title of the paper oh right yeah reward free curricular oh God what was\n",
    "the title of the paper reward free curricular for training robust World models\n",
    "that was it okay so um give us the elevator pitch yeah totally so basically like\n",
    "the overarching um question that we're trying to answer with this paper is like\n",
    "how should we go about training like very general agents so in the context of\n",
    "the paper we think of a general agent as being one that's able to perform a lot\n",
    "of different tasks so we might think of these as different reward functions if\n",
    "we're think thinking of it from a reinforcement learning perspective um but also\n",
    "be able to perform those tasks in lots of different environments so you know we\n",
    "don't want a robot to just be able to do you know pick up tasks or do tasks in\n",
    "my like my kitchen specifically we want the robot to be able to go into like\n",
    "arbitrary Apartments and also be able to do those tasks in like arbitrary\n",
    "environments and so we we kind of thought about like yeah how do we want to\n",
    "create an agent that can do such a thing and we argue in the paper that a good\n",
    "way of doing it would be to have an agent that um has a very general World model\n",
    "um so a world model meaning that it can predict the outcome of sequences of\n",
    "actions and predict what will happen if it does certain actions and so we argue\n",
    "if we have a very general World model that can lead to a very general agent\n",
    "that's able to um perform you know a variety of tasks in different environments\n",
    "and so then you know once we've established that we kind of asked the question\n",
    "of how do we get a very general World model and what does it mean to have a good\n",
    "World model that works um well in a very general setting across different\n",
    "environments and different tasks like how do we Define that and how should we\n",
    "gather data to do that beautiful so I really enjoyed reading the paper and um it\n",
    "reminded me a lot of um Kenneth Stanley's poet paper so he was uh doing this\n",
    "thing called curriculum learning and it's it's really related to machine\n",
    "teaching as well there's quite a few things in in M machine learning where you\n",
    "say well if we had a really principled way of of selecting the best training\n",
    "data and presenting it to the Learner in the best possible order could the\n",
    "learner be better and in that poet paper Stanley was kind of generating a\n",
    "diverse set of environments and like training um a learner on those things and\n",
    "you're doing something very similar and you're using this Mini Max regret which\n",
    "is a concept from decision Theory can you bring that in yeah absolutely so um so\n",
    "I guess we have this notion of like wanting to be um to perform well across a\n",
    "wide range of scenarios right so scenarios in our context mean like different\n",
    "environments and different tasks and kind of like the most standard way of\n",
    "thinking about that especially in reinforcement learning or in machine learning\n",
    "in general is you think about like the average performance so so how do I\n",
    "optimize like the expected reward across all of these different scenarios um and\n",
    "a lot of the work that that M's done as well kind of argues that just just\n",
    "optimizing for expectation isn't necessarily the best um the best objective so\n",
    "you know we can imagine in the real world we don't really know like the\n",
    "distribution over possible tasks or or anything well you know in most situ we\n",
    "don't know things like that and so maybe a better objective is to try and be\n",
    "robust instead and robust basically we can think of that as meaning like we\n",
    "should do reasonably well in every situation we could be in and that that's kind\n",
    "of what a robust objective is um and one of the ways that you can define a\n",
    "robust objective is via Minx regret and so regret means like suboptimality like\n",
    "how well did I do relative to the best I could have possibly done so it means\n",
    "basically the same thing as it does in normal English um and so the Minimax\n",
    "regret objective basically says across all possible situations I want to try and\n",
    "do um minimize the regret across all possible situations minimize the maximum\n",
    "regret I should say so that means in all possible situations we should do almost\n",
    "as well as the best we could have possibly done um and I guess just to contrast\n",
    "this against the standard objective for robustness so the more common objective\n",
    "for robustness at least traditionally is like maximan performance that means\n",
    "maximize the performance while the environment's like minimizing and choosing\n",
    "the most adversarial environment or the most adversarial scenario um but but the\n",
    "problem with kind of the max objective is that in some environments you just\n",
    "can't do anything let's say like some s situations is too hard you're doomed and\n",
    "so if in some situations you're doomed and you always get like zero reward or\n",
    "negative Infinity reward that means there's no incentive to try and do better in\n",
    "any other environment because your maximan reward is always going to be zero and\n",
    "so therefore I think like minchi argues as well as Michael Dennis and a lot of\n",
    "these recent papers argue that Minimax regret so minimizing the maximum sub\n",
    "optimality is actually like a better objective for a general agent that's robust\n",
    "fascinating so um if I understand correctly is is it a way of saying I want to\n",
    "have the best case worst um expected regret uh yes so basically Minimax regret\n",
    "is saying that uh if you assume that you know the environment is adversarial to\n",
    "you in in some way like when you're training or at inference time when you're\n",
    "actually testing your policy out in the real world um Minx regret is saying the\n",
    "agent should behave the model should behave in a way that minimizes its worst\n",
    "case possible regret uh over all the possible conditions of the world uh that\n",
    "that this adversary could choose what's really interesting about this paper is\n",
    "we are talking about the reward free exploration phase and we're also talking\n",
    "about the domain of modelbased reinforcement learning as opposed to um you know\n",
    "let's say value based reinforcement learning where um you get this entanglement\n",
    "right so the the Dynamics the model of the world it's still in there but it's\n",
    "kind of in meshed with this with this value model whereas in model-based\n",
    "reinforcement learning in a principal way we kind of separate out the parts so\n",
    "that we can do explicit planning and a and simulations and stuff like that so\n",
    "we're very much in this model based domain right yeah absolutely yes we focused\n",
    "on yeah model based reinforcement learning or some people like to call this like\n",
    "the world model setting more recently but yeah like you said we you know in\n",
    "typical like model fre reinforcement learning we we we typically aim to learn a\n",
    "policy in a value function and yeah as you said like that value function is kind\n",
    "of implicitly encoding the Dynamics through the fact that we learn the value\n",
    "function using the Balman equation so so the bman equation kind of propagates\n",
    "the information between like transition and the environments through the value\n",
    "function so so the value function will like implicitly have the Dynamics in it\n",
    "um but in modelbased reinforcement learning we want to very explicitly model the\n",
    "Dynamics of the environment and so what I mean by that is we want to be able to\n",
    "take some previous sequence of observations perhaps those are images and then\n",
    "also condition on the next action we want to take in the environment and then be\n",
    "able to predict the distribution over the next observation or state so we're\n",
    "very explicitly modeling the Dynamics of the environment okay now this is really\n",
    "interesting because you know people think about reinforcement learning and in\n",
    "reinforcement learning you don't so much care about having a model of the world\n",
    "you care about building trajectories that lead to some you know task or goal or\n",
    "whatever that you're interested in so like I mean just just in broader terms\n",
    "what what what do we get from explicitly modeling the world so there there are a\n",
    "few Arguments for why we would want to explicitly model the environment so so\n",
    "one of which is um a lot of people would argue that you get better sample\n",
    "efficiency by modeling the environment and the argument for this is you know the\n",
    "reward function might be quite sparse and so if you're just relying on like the\n",
    "propagation of rewards backwards to try and learn the optimal behavior that\n",
    "might not be as efficient as actually learning the Dynam DS because the Dynamics\n",
    "can be learned from every single transition that you have it's kind of like a\n",
    "standard supervisor run supervised learning problem so so you kind of have like\n",
    "a richer signal to learn from which might arguably lead to better sample\n",
    "efficiency um but I think like the more concrete arguments that I would argue\n",
    "for are that if you have a model of the environment it's it's some kind of more\n",
    "General thing that you can then use to develop better decision- making later on\n",
    "so so if you just learn a value function you're kind of only learning how to\n",
    "optimally do that specific reward function or optimize that specific reward\n",
    "function um but if we have a model of the environment we can kind of arbitrarily\n",
    "be given some task later down whether it be a reward function or goal state or\n",
    "something like that and we can then plan to optimize that task later down the\n",
    "road so I would think that um you know it's kind of a much more General way of\n",
    "having a powerful decision-making agent rather than just specifying like one\n",
    "task and learning the optimal kind of policy for one task and I guess another\n",
    "thing that I'll add to that is um rather than only learning like a feed forward\n",
    "policy like you would and reinforcement learning so something that Maps directly\n",
    "to actions the other thing that a world model allows you to do is also to do\n",
    "online planning so you can imagine at test time we're trying to deploy in the\n",
    "environment but we can actually do a bit more further planning through the world\n",
    "model to then work out what the best action is rather than relying on just a\n",
    "newal network to immediately output an action and there's kind of a lot of work\n",
    "showing that if you can do this like planning at test time you can kind of get a\n",
    "lot of a better performance on a lot of environments especially things that that\n",
    "really rely on um search to do well things like go and like these kind of games\n",
    "where you do have to think explicitly ahead in the environment and so I would\n",
    "think the main reasons you would want to consider um learning a world model and\n",
    "maybe a last point I'll just add is that I think this is kind of a um again like\n",
    "unclear whether this is true necessarily but but I think some people would argue\n",
    "that a world model will generalize better than learning a value function so you\n",
    "can imagine like a world model is learning things like you know State\n",
    "transitions so you can imagine if you if you're training on state transitions\n",
    "the model is kind of implicitly being forced to learn something like physics or\n",
    "something like that and so if you're like very explicitly forcing the model to\n",
    "learn something like physics You could argue you know we'll go to some new state\n",
    "and the rules of physics will still hold and therefore the world model will\n",
    "still be quite good at the new state potentially whereas if you learn a value\n",
    "function I guess it's a little bit less clear as to whether you're put on a new\n",
    "situation will the same kind of structure of that value hold as it would a model\n",
    "anyway sorry that was a bit of a long answer but no no it's fascinating I mean\n",
    "when I was reading the paper the one of the reads I got is um in machine\n",
    "learning we are often overcoming the curse of sparcity so of course like in\n",
    "trajectories and reinforcement learning that that's quite intuitive but even in\n",
    "learning the world model itself the model um just because of the way they're\n",
    "trained it it tends to compress the world into small little motifs and actually\n",
    "the world is quite complicated and we need to combine the motifs together in\n",
    "lots of interesting and and Rich ways and by exploring through the world model\n",
    "we're almost kind of like making we're forcing it to make those connections yeah\n",
    "and I think um you know to follow up on Mark's um uh Mark's point I think it's\n",
    "also interesting because especially in the Waker paper uh the world model\n",
    "setting we're looking at specifically reward free world models and so\n",
    "essentially there's this uh EXP decision to separate separate out the two\n",
    "components of a world model which is essentially the Dynamics function which\n",
    "tells you how things transition from state to state how does a state transition\n",
    "state of the world transition to the next state of the world given an action\n",
    "that the model or the agent is taking in that world and the reward that it\n",
    "receives so this latter part the reward is defined by the reward function and so\n",
    "uh you know I think Mark was uh to follow up on his point a lot of the benefits\n",
    "of the world model is in this design Arrangement is that you can compositionally\n",
    "separate out this Dynamics aspect from the reward aspect so the general idea\n",
    "would be why should an agent trained in such a world model be able to generalize\n",
    "to a new setting well maybe if that setting shares a lot of the underlying\n",
    "Dynamics in that version of the world for example rules of physics and the agent\n",
    "has learned how to exploit those to accomplish um navigation around that\n",
    "environment or reach different types of tasks uh achieve different kinds of\n",
    "tasks in that environment then you can um sort of superimpose a different reward\n",
    "function that essentially defines a different task because the reward function\n",
    "defines what task success is so you can essentially superimpose different tasks\n",
    "on top of that Dynamics model and you would you know you could expect that the\n",
    "agent could learn more quickly because it's already mastered sort of the\n",
    "foundational skills of navigating or manipulating different aspects of the\n",
    "Dynamics of that world we've been on a bit of a journey here um I think over the\n",
    "last few years in in the literature of um we we want to have robust models and\n",
    "we're doing that by kind of perur and you know making a bunch of manipulations\n",
    "to the environment and there there was this domain randomization and there's\n",
    "like unsupervised environment design and of course your your iteration now is\n",
    "doing this in in in the domain of um reward free exploration but can you take us\n",
    "on on that Journey sort of maybe starting with um domain randomization kind of\n",
    "just to uh elaborate on something that Mark was previously talking about which\n",
    "is that the typical you know standard setup in machine learning is to uh\n",
    "essentially optimize a model's performance uh over a uniform distribution uh\n",
    "over the data points and so this is really just randomly sampling data points\n",
    "and we try to minimize the loss over those data points for whatever objective\n",
    "we're trying to minimize or maximize in reinforcement learning um we want to\n",
    "train agents that can perform well in lots of different uh versions of the\n",
    "environment and so um you can think of each environment uh almost as a bundle of\n",
    "data points right it's kind of the set of trajectories that the agent can um can\n",
    "encounter within that version of the world and we essentially in reinforcement\n",
    "learning we want to learn to maximize uh the reward of the agent uh in that set\n",
    "of uh trajectories so we want to specifically start to uh actively pursue those\n",
    "trajectories that give us the highest reward and we learn from the reward signal\n",
    "as the feedback signal for figuring out you know which actions uh and therefore\n",
    "which trajectories will lead to maximizing that reward and so typically um when\n",
    "we operate in the multitask setting uh we essentially randomly sample different\n",
    "versions of the environment and essentially have the agent try to maximize its\n",
    "performance its reward on that random sample of environments uniformly uh\n",
    "sampled from you know the set of possible environments um and this is\n",
    "essentially uh causing the agent it'll cause the agent to learn a policy that's\n",
    "optimal for essentially uniform distribution over those environments um but of\n",
    "course this is kind of a naive assumption because we essentially are assuming\n",
    "that every possible version of the environment is equally likely which is\n",
    "obviously not true because some versions of the world will not be as likely as\n",
    "others uh for example like if you walk outside the sky is usually blue and not\n",
    "green and so you know when the sky is orange maybe that happens if you're in\n",
    "California and there's a wildfire but that's not usually the case and so instead\n",
    "what we can do is we can turn to decision Theory and think of um sort of more\n",
    "sensible approaches to what it means to act optimally uh when you're uncertain\n",
    "about uh what state of the world uh the world will be in and so the thing that\n",
    "we focus on in this paper um is this idea of Mini Max regret where it is this\n",
    "idea again of having the agent act in a way that essentially minimizes its worst\n",
    "case regret um in any possible uh state of the world so largely you know this is\n",
    "a shift from randomly sample what it means in practice is you want to shift from\n",
    "randomly sampling environments during training to essentially uh sampling\n",
    "environments that maximize the agent regret and what this means is you're now\n",
    "actively sampling for those environment settings where the agents um\n",
    "experiencing the most regret and here regret is defined just simply as what does\n",
    "the optimal agent do in that version of the environment and what did this\n",
    "current agent that's learning do in that environment and so there's this Gap in\n",
    "performance and you want to actively find those environments where that Gap is\n",
    "maximal and if you view this as this adversarial game now between you know uh an\n",
    "adversary like nature that's choosing the environment and the agent that's\n",
    "learning to solve the environment um you can think of the adversary as you know\n",
    "having a a payoff function in that game or it's rewarded for the based on the\n",
    "regret that the agent experiences and the agent is trying to shrink that regret\n",
    "so the agent you can think of as being rewarded for you know um the the negative\n",
    "of that reward so the agents reward signal is you can think of as the negative\n",
    "of the regret and so now you have the setting where you can essentially view\n",
    "this training process this active sampling process as a two player zero some\n",
    "game where the adversary is you know rewarded for the regret of the agent in\n",
    "each environment it chooses and the agent is rewarded based on the um the agent\n",
    "receives the negative regret as its uh payoff and so um we know that in two\n",
    "player Zer some games there's always a uh this there's always a solution called\n",
    "a Nash equilibrium and so this is an idea in Game Theory where basically this is\n",
    "um a choice of behaviors on both parties or a choice of strategies on both\n",
    "parties in the game such that um no play can do better unless the other player\n",
    "changes their strategy and so you can think of this as a situation where you\n",
    "know I'm not neither player is incentivized to deviate from their behavior uh\n",
    "once they reach this choice of mutual strategies and so we know that all two\n",
    "players zero some games have a Nash equilibrium uh set of strategies between the\n",
    "two players and in this case uh we know there's additional theorem called the\n",
    "Mini Max theorem which says that when in a two player Zero Sum game specifically\n",
    "two players and zero sum when um you are at the snash equilibrium setting then\n",
    "each player must be playing what's called U the Mini Max um the Mini Max\n",
    "strategy which means that each player is minimizing the maximum um minimizing\n",
    "the maximum reward for the other player and so here the reward again is the\n",
    "regret and therefore just based on this known you know theorem about two player\n",
    "zero some games we know that um the agent which is you know receiving the payoff\n",
    "of negative regret it's the Min player it must be implementing the Minimax\n",
    "regret strategy and so this is how we essentially can shape the training process\n",
    "to essentially um arrive at an agent that performs Mini Max regret\n",
    "decision-making rather than decision-making that optimizes um just a uniform\n",
    "sample of environments okay so can I play back um some of those things as I\n",
    "understand it so um essentially we we are we're building a model which will\n",
    "learn to select the environments where we perform badly on and then we fine on\n",
    "those environments because we're leaning into the gaps we're saying where where\n",
    "do I perform badly let's fine tune on that and then you're saying that if we\n",
    "continue to do this as a kind of adversarial sampling game that we will reach a\n",
    "Nash equilibrium so it will converge in a good place but help me understand that\n",
    "why would it you know it seems to me intuitively that it might be unstable or it\n",
    "might not quite what why does it converge so there's no guarantees around\n",
    "convergence and and so I think this is an area where there's a lot of room for\n",
    "innovation around these methods a lot of this is um this is more I would say\n",
    "like theoretical motivation around why we think actively sampling environment\n",
    "settings based on um estimates of regret is a good idea and another Point\n",
    "related to that around sort of this gap between the theory I I just um explained\n",
    "and in practice is that uh regret itself is a pretty hard quantity to actually\n",
    "uh measure in practice because you know knowing regrets defined as what's\n",
    "Optimal Performance um minus my agent performance so you kind of have to know\n",
    "what Optimal Performance is and in general you don't know the optimal Behavior\n",
    "therefore you don't really know the Optimal Performance on any environment\n",
    "unless it's like a very toy setting and so um in practice we also use\n",
    "approximations for the regret uh in order to do this kind of active sampling and\n",
    "so um there's a lot of deviations between theory and practice um so there's no\n",
    "guarantees you know that different forms of gradient-based optimization uh for\n",
    "RL training would actually lead to converging to Nash equilibria uh a lot of the\n",
    "theory is just stating that if you were to run the system the learning system\n",
    "for a long time if we make the assumption that the optimization algorithm is uh\n",
    "fairly good at producing you know an improved response to the other player in\n",
    "this type of Zero Sum game you if you're assuming that if the successive sort of\n",
    "series of best responses uh that the optimization algorithm is generating um\n",
    "continues to improve over the previous ones you could make the assumption that\n",
    "maybe eventually it does get to that equilibrium but um there is no mathematical\n",
    "guantee that this actually happens what we want to do is um uh you know build\n",
    "this latent uh Dynamics um uh you know predictive model which is a simulacrum of\n",
    "of what the idealized version is but we don't have a way of directly Computing\n",
    "the regret so we kind of perform um you know we learn a proxy for that regret\n",
    "how does that work so we think of regret in in the following way so so there's\n",
    "kind of this um old school result from like um mdp Theory or maybe it's not that\n",
    "old but like 20 years ago or something like that called the simulation lur and\n",
    "that basically says that you know if if we let's assume for now that we we have\n",
    "like an optimal planner so we can give our like model of the world to this\n",
    "optimal planner and and some reward function let's say later down the road we\n",
    "get given some reward function and so we give the model and the reward function\n",
    "to our optimal planner and we assume that this planner can return the optimal\n",
    "policy in our model um so we kind of have this you know planning Oracle um and\n",
    "if we assume that we can do that then we can think about the difference between\n",
    "like how good the policy would be from um a planning Oracle in the model versus\n",
    "the truly optimal policy in the real world and so what the simulation L tells us\n",
    "is that you know the difference between these two policies so the one found by\n",
    "acting optimally in the model versus the truly optimal one is bounded\n",
    "essentially by the error between the model and the real world under the\n",
    "distribution of states that the policy uh would generate so so you know it only\n",
    "it only matters that we have low eror where the policy would go essentially\n",
    "because you know if there are some states that are just completely irrelevant\n",
    "what the policy is going to do it's not really going to matter if the if the\n",
    "models not accurate there um so we kind of use this result to think about the\n",
    "regret so that that gives us like you know if we have like one um one true mdp\n",
    "and one model of an mdp and one reward function the simulation limit can tell us\n",
    "you know what would kind of be the regret if we did this optimal planning um\n",
    "within this one model of the um of the mdp um but then in our work we're we're\n",
    "not really interested in the sitting of like one mdp one reward function um so\n",
    "we start to think about you know what happens if we have arbitrarily many\n",
    "environments as well as arbitrarily many reward functions which we don't know in\n",
    "advance and then I guess the other thing that I should say like you you alluded\n",
    "to like latent Dynamics is you know these existing results are assuming that we\n",
    "have an mdp that's fully observable meaning you know exactly what the state of\n",
    "the environment is um but usually when we think about like World models or even\n",
    "or just maybe more modern reinforcement learning we're really interested in\n",
    "learning from like quite high-dimensional signal so images or maybe well\n",
    "probably images but maybe there are other high high dimensional signals we we\n",
    "want to reason about um and because we're just using image observations this\n",
    "means like the world is like partially observable like we can't infer everything\n",
    "we need to know about the world just from one image you know for for basically\n",
    "any physical task like the velocity of objects is important but you can't infer\n",
    "that just from one image um so in this partially observable environment we\n",
    "really want to take um a sequence of observations because we need to to use\n",
    "those sequence of observations to infer what the state is so you know viewing a\n",
    "sequence of images will help me to infer what the um the velocities are for\n",
    "example and so we can think of this as inferring like a belief a belief over\n",
    "what the state is in a partially observable nvp um so we need this full sequence\n",
    "of images and we need to use the full sequence images to then to be able to\n",
    "predict ahead what the next observation will be um that's kind of what you know\n",
    "most World models are attempting to do um but if we just like Tak in a bunch of\n",
    "images and then try and directly predict images again that's like quite a hard\n",
    "problem um to just like just predict straight an image space and so the most\n",
    "common thing to do is kind of to take your previous sequence of images and then\n",
    "try and get like some compressed representation of the history of images into\n",
    "like the latent State um and then predict the Dynamics in the latent state so\n",
    "yeah so I have my sequence of images I kind of compress these somehow into some\n",
    "vector and then I give it a new new action and I try and predict what the next\n",
    "kind of latent Vector will be given this new action and this now represents my\n",
    "prediction of the the Dynamics in the world and then if I want to um you know\n",
    "predict what the next observation would be in image space then I can also decode\n",
    "that back to an image um but then a lot of works also argue that maybe we don't\n",
    "want to actually learn to predict the entire image so maybe you don't want to\n",
    "actually decode the entire image but that's that's another aspect that we might\n",
    "want to get into but there's this whole broad story of of um working in the\n",
    "latent space and um in reinforcement learning there was that paper called World\n",
    "models by you know David Haron and Schmid Huber and it it also I think has a\n",
    "relationship with you know what laon's doing with jeer and these like you know\n",
    "joint embedding prediction architectures so there seems to be something magical\n",
    "about working in in the latent space and also you were talking about um you know\n",
    "partially observable Markov decision processes and you know that seems to be\n",
    "this idea that we need to have a modeling framework for the world and I I guess\n",
    "like the ideal situation would be is that like we just we we knew exactly what\n",
    "would happen you know every single time step in every single state um but we\n",
    "don't you know so so so we model it as a partially observable Markov decision\n",
    "process and the Markov bit is quite interesting as well I mean maybe um you guys\n",
    "can just introduce why do we use that as a model so markovian basically just\n",
    "means you only need to look at like the current state to be able to infer all\n",
    "the information about the system um so so in a Markov decision process we have\n",
    "some State and then we assume that we're able to take some actions and given\n",
    "some State and some action we get some distribution over next states of the\n",
    "system and then the the system will transition according to that distribution to\n",
    "the next state and this is just like kind of a general framework for modeling\n",
    "like systems that we might want to control so you know it kind of dates back to\n",
    "like early work and control theory but then it's also the main framework used in\n",
    "reinforcement learning um yeah in the reinforcement learning setting because\n",
    "it's the decision process we we also add in a reward function which tells us how\n",
    "good it is to be in a certain state or to execute a certain State action pair um\n",
    "but yeah as you said with relating to like partial observability in a lot of\n",
    "like systems we we don't actually know what the true like state of the world is\n",
    "so so you can imagine you know if we want to think of the entire world as a\n",
    "partially observable mdp we can't just have some Vector telling us exactly what\n",
    "the true configuration of the world is or or maybe that exists but we can't we\n",
    "definitely can't just know that and so we usually think of it as being a\n",
    "partially observable system um so this means that like given given the state um\n",
    "you know at each step we'll basically get some distribution over observations\n",
    "and we just get to observe that observ so you know the state of the world could\n",
    "be what it currently is in here and maybe my observation is like a camera image\n",
    "so I only get some camera image of the world that allows me to infer a bit of\n",
    "information about the state um and because it only allows me to infer a bit of\n",
    "information about the state it doesn't tell me the whole state it really you\n",
    "need to keep track of all the observations you have to be able to keep track of\n",
    "all the information you have about the world so you know you could imagine um if\n",
    "the task is for me to remember how to get out the door a while ago um you know I\n",
    "I I don't just need need to be able to like look at my current image of the\n",
    "world to be able to infer that information I need to have kept track of like all\n",
    "my previous information as well um so that's kind of why we think about often\n",
    "want to think about like partially observable environments as opposed to fully\n",
    "observable ones amazing amazing so so so mention maybe you can um uh bring in\n",
    "this this latent idea sure and and sort of contrast that to what Lon is doing as\n",
    "well sure I mean so I think in machine learning and deep learning uh there's\n",
    "this General Paradigm that's been around you know since the Inception which is\n",
    "learning uh late latent representations of data and one of the benefits of\n",
    "learning latent representation is that um you know ideally your objective uh\n",
    "that leads to learning these latent representations is that you are ultimately\n",
    "learning a lower dimensional representation of the data or dynamics that you're\n",
    "modeling like in our case with the world model um that captures just what is\n",
    "necessary it's a more compact representation of just the information that's\n",
    "necessary to predict the task you're trying to predict and so um with uh with\n",
    "our case uh or Laten space World models a lot of the benefit of working in the\n",
    "latent space is that if uh as opposed to working in the full image space for\n",
    "example if your observations are images like in a video game is that there could\n",
    "be a lot of sporus features or you know a lot of additional information that you\n",
    "could be expending lots of compute and um you know gradient updates just to\n",
    "learn those patterns when they don't actually impact the ultimate um transition\n",
    "Dynamics or reward dynamics that you need to learn in order to do well in that\n",
    "environment so one example is if you have a game where you know maybe the\n",
    "background is different uh because it's daytime or nighttime or it's close to\n",
    "Sunset um but ultimately you know the background doesn't really impact uh how\n",
    "the player moves around in the environment or whether they've reached the end\n",
    "goal of the task and so if you're training a uh model where it needs to compress\n",
    "a lot of this information first into a smaller dimensional latent Vector latent\n",
    "representation um you don't really need you would expect that latent\n",
    "representation not to actually capture it would start to ignore the background\n",
    "color and it might only capture certain features of the environment that can um\n",
    "essentially if you were to decode it back out it might only capture certain\n",
    "information about the environment that's predictive of the actual task that you\n",
    "want to solve um so maybe if the task is to say reach a coin at the end of a\n",
    "level then maybe the lant representation would capture the presence of the coin\n",
    "or whether the the proximity of the character you're controlling to the coin um\n",
    "and so uh with the JEA related work I think a lot of this is also you know\n",
    "motivated with with this idea where if we can learn a better latent space\n",
    "representation um of images or videos or whatever modality we're trying to model\n",
    "um it's a much lower dimensional computationally efficient representation uh\n",
    "that you can um you can effectively use for Downstream tasks um I'm not s I'm\n",
    "actually not super familiar with exactly you know the the visual JEA uh uh\n",
    "objective so I don't think I can say too much about that oh that's okay yeah I\n",
    "mean but but yeah I mean you pretty much nailed it so um I mean Lon even gives\n",
    "the example of like um you know in self-driving cars you might not be interested\n",
    "in the leaves on on on the road you know so like with increasing levels of of\n",
    "nesting you kind of like learn to ignore the things that are not relevant and\n",
    "focus on the things that that are relevant but we we're almost getting to the\n",
    "center of the bullseye here so in intelligence to me is all about model building\n",
    "and and and that's what these abstractions are they're models that kind of are\n",
    "predictive about the thing that that that's relevant and kind of like ignoring\n",
    "what is not relevant and we build better models when we have a curriculum\n",
    "apparently this happens in nature as well Max Bennett I was talking to him the\n",
    "other day and he said you know our genome doesn't encode all of our skills um\n",
    "explicitly because it would be too inefficient to do so but they do encode a\n",
    "kind of curriculum so we teach babies you know we Babble with babies and we\n",
    "teach babies how to talk and stuff like that so so the curricula is is really\n",
    "important and then we we're getting to the center of the bullseye which is\n",
    "intelligence in in general now I think Lon thinks that it's specialized and and\n",
    "what that means is that there there motifs that statistically generalize and\n",
    "what that means is that you do need environments you need to find motifs that\n",
    "are present in in as many environments as possible and those are the\n",
    "generalizing features do would you agree with that yeah definitely I think that\n",
    "a lot of um so a lot of really powerful machine learning methods for example uh\n",
    "are trained in simulation and when you're training in simulation there's a\n",
    "concept in control from control literature uh called the sim2 real Gap and\n",
    "essentially this is essentially quantif in a performance difference between uh\n",
    "well it's quantifying a few things one is just how different is the are the\n",
    "actual physical or other other kinds of Dynamics captured by your simulator\n",
    "compared to reality so if you have a physics simulator how accurate are for\n",
    "example the friction Dynamics or different kinds of contact Dynamics uh in your\n",
    "robotic simulator compared to those actual Dynamics in the real world with a\n",
    "real robot um and this also leads to a Sim tooreal Gap in terms of performance\n",
    "so if you train in the simulator you know a lot of times what machine learning\n",
    "is really good at is is really good at learning to exploit whatever system\n",
    "you're training the uh the model in and so it's fairly um common for you know\n",
    "systems that or models that are trained within a simulator to learn to\n",
    "eventually exploit the simulator and so actually like one big area of um games\n",
    "AI is using is actually leveraging this idea where they essentially use ml\n",
    "models they optimize ml models to within a certain game environment to try to\n",
    "find bugs within that environment to look for exploits automatically um so ml\n",
    "system is very good at finding exploits and whatever system you have but then\n",
    "the issue is those exploits are usually where exactly where the gap between your\n",
    "simulator and re reality resides and so you actually don't want your model to\n",
    "learn to exploit these differences between the simulator and reality to get a\n",
    "high performance uh because that kind of defeats the purpose of then later\n",
    "transferring your model that's trained in simulation to reality because now in\n",
    "reality obviously the model can't exploit those same those same glitches within\n",
    "the simulator um yeah so yeah yeah I mean because the reason this is really\n",
    "interesting is is that the the premise of your paper is that it is possible to\n",
    "build a generalist agent which means it's an agent that can be fine-tuned and\n",
    "worked really well on a on a whole bunch of Downstream tasks and to me that\n",
    "implies that at least in our physical world in any situation you might use this\n",
    "agent that there are General motifs that it could have learned during\n",
    "pre-training that it could like you know become activated in any situation um\n",
    "does that is is that fair yeah May I can say something about um just the way\n",
    "that we should could think about like the different like latent Dynamics\n",
    "objectives so so I think I agree that like at least when I try and think about\n",
    "how I think or how people think I think I agree that like you know a truly\n",
    "intelligent system should kind of think through the world and like a very\n",
    "compressed representation of the world like if I'm trying to like think through\n",
    "how to go to the airport like I'm definitely not like predicting ahead in terms\n",
    "of like the raw image space of trying to predict every image I might observe on\n",
    "the way to the airport and things like this and so I think we have this kind of\n",
    "like trade-off between you know um like you said with the VF paper like should\n",
    "should we just try and like um kind of basically model like the minimum\n",
    "information we need about the world to try and you know do the do the relevant\n",
    "task in the world and I think what you're saying I think that probably is maybe\n",
    "more what we think about when we think about like human intelligence or\n",
    "something like that um but then there's also this other way where we just say\n",
    "we're going to just like train enforce the model to be able to predict ahead\n",
    "every single image and so in our paper we do actually enforce that the model has\n",
    "to predict the next image um and so um basically what this might mean is yeah\n",
    "maybe the model does you know hopefully it does like like you said like kind of\n",
    "capture the underlying like true things that matter in in in the environment but\n",
    "it might also mean like what we were saying with like the leaves example like\n",
    "this might Force the model to kind of capture a lot of irrelevant details that\n",
    "don't really matter like the leaves on the ground and things like this and so\n",
    "you know maybe that means it isn't actually capturing the underlying motifs it's\n",
    "actually just getting good at image Generation Um but then or or image\n",
    "prediction I should say um but then I've also heard arguments kind of saying you\n",
    "know so what if people don't really think in terms of like image prediction you\n",
    "know I you know we think in terms of like like these high level motives but\n",
    "people have other people would argue that you know kind of the machine learning\n",
    "Machinery is there to do really good image prediction so so if if we if we can\n",
    "get a model that can actually just like predict images ahead really well um and\n",
    "not really worry so much about whether it's reasoning about these like high\n",
    "level features you know if you can predict images ahead really well you know\n",
    "that's enough to make to do good decision- making a lot of context so I think\n",
    "there's this kind of like contrasting ways of thinking about you know image\n",
    "prediction is good enough we'll just predict like really visually good scenes\n",
    "and that will be good enough for decision- making or do we want to the model to\n",
    "try and reason about like more abstract features of the environment and that's\n",
    "kind of a more intelligent way of reasoning about the world um and yeah I think\n",
    "that's a very interesting tradeoff um yeah yeah I mean like it's um like the\n",
    "biggest problem in machine learning is overfitting you know so as you say like\n",
    "that there all of these statistically generalizing features but they generalize\n",
    "within a domain and the domain might be like your your simulator or like you\n",
    "know how you're training it rather than how it's being used in in production and\n",
    "then as you say that there's also this Almost Human chauvinistic or puritanical\n",
    "view on this which is that well um you know it does the right thing for for the\n",
    "wrong reasons or or I I use different motives to do the reasoning so that thing\n",
    "must be doing it wrong you know what I mean and um I was talking with Chris\n",
    "Bishop at MSR the other day and and you know he's um big on symmetries and you\n",
    "know the kind of stuff that like Max Welling and Tak Cohen and bronstein and um\n",
    "deep M have done loads of cool stuff on on this but it's this idea that like we\n",
    "know the world um has a certain geometry it has certain physical prior so like\n",
    "we can deliberately um you know kind of construct the approximation class in\n",
    "machine learning method so so that like we it an easier problem right because\n",
    "because we know we know the thing is in there yeah so I mean I guess sort of the\n",
    "uh slight tangent I went into around the simt Gap I guess part of the point I\n",
    "wanted to make there is that um you know one way around the Sim to Gap is you\n",
    "could try to train um you could try to parameterize a very large space of\n",
    "possible versions of reality and this is kind of the motivation behind this\n",
    "method of domain randomization where you sort of say this is the you know this\n",
    "is the specific task domain I care about I can parameterize the different uh\n",
    "vers of the task with a few parameters and I basically want to search over the\n",
    "space of parameters and train my model or my agent on all possible variations of\n",
    "this world but obviously that's not very sample efficient because that design\n",
    "space could be huge could be massive and so instead we like these active\n",
    "sampling strategies like we were talking about earlier uh around Mini Max regret\n",
    "style um active sampling where you sample those environments that maximize your\n",
    "regret or some other type of objective maybe like uncertainty uh similar to what\n",
    "we do in the Waker paper um but ultimately these things this active sampling\n",
    "process it leads to uh what we like to call an auto curricul automatic\n",
    "curriculum um and this is in contrast to Prior curriculum learning works because\n",
    "here this is um an automatically generated curriculum so you you can kind of not\n",
    "have any predefined notion of what is easy or hard it's purely fixed to what is\n",
    "easy or hard for the model in terms of how good the model is at performing at\n",
    "those tasks and so it's nice it's an automatic curriculum so you can think of it\n",
    "as almost like weaving a path through this high-dimensional design space\n",
    "automatically such that if the uh agent or model were to train on data along\n",
    "this path of environments through its experiences in this path of environments\n",
    "during the training curriculum it'll basically be maximizing some sort of\n",
    "Information Gain objective um because you know for example regret if there's a\n",
    "high regret that's that means there's a high uh ceiling there's a high Gap in\n",
    "terms of how much the agent can improve which implies that there's a lot more\n",
    "for the agent to learn in those environments so it's sort of this like Optimal\n",
    "you want to find this optimal path weaving through this High dimensional design\n",
    "space of environments now the danger here is that as you do this uh Auto\n",
    "curriculum the auto curriculum uh could also get go Haywire very easily because\n",
    "the design space is so big if you're training and simulation which we have to do\n",
    "because these methods are so sample inefficient we need so much data to train\n",
    "them um you want to train in simulation but if you're doing the auto curriculum\n",
    "in the simulation design space it could start to Veer very easily and quickly\n",
    "into different corners or niches of the design space where um you know the\n",
    "parameters no longer really make sense in terms of mapping to a physical reality\n",
    "or a real world scenario that we as human users uh actually care about and so\n",
    "kind of it would be you know it would defeat the purpose of spending all this\n",
    "compute to train this model that could then help us in the real world because\n",
    "now it's veering off into parts of the design space that don't really matter for\n",
    "humans it's kind of noisy parts of the design space and so this kind of leads us\n",
    "to this question of grounding how do we ground curricula how do we align the\n",
    "curricula such that you know they can still do their exploration through this\n",
    "active sampling type of procedure over the environment design space but at the\n",
    "same still at the same time maintain at least some proximity to the parts of\n",
    "that design space that are relevant to what humans care about in terms of the\n",
    "actual tasks they represent I've been speaking with Kenneth Stanley a lot\n",
    "recently and we're talking about open-endedness and in general I've been trying\n",
    "to come at this problem from multiple angles and I've been using the lens of\n",
    "agency because I think agency is something that happens in the real world and\n",
    "that's why we have this Divergent process because we have multiple agents you\n",
    "know kind of like you know undirected following their own gradient of\n",
    "interestingness so in in evolution that's a great example of that it is this\n",
    "Divergent process but it's also grounded it's physically grounded you know so\n",
    "it's like the physical world creates some kind of constraints on on on the\n",
    "things that that are found and um I mean you know Clon called this AI generating\n",
    "algorithms there's quite a few different takes on this but the idea is that um\n",
    "to search this complex search space we we need to have a Divergent search and\n",
    "that's like we actually need to create the problems and the solution so like in\n",
    "the real world the the the you know the drafts had the problem of like eating\n",
    "the leaves from from from the trees and the problems and the solutions get\n",
    "generated in Tandem and this whole thing just kind of grows and grows and grows\n",
    "and that seems to be the most important feature that is missing in current AI\n",
    "systems and the grounding or the um Stanley calls it the gradient of\n",
    "interestingness I'm not sure whether You' agree with that but um I mean what\n",
    "Mark what what what what do you think about the importance of like this\n",
    "Divergence in in AI kind of the current Paradigm of machine learning of kind of\n",
    "like you know Gathering some data set beforehand or specifying some simulator\n",
    "beforehand if it's reinforcement learning is kind of good enough to do like a\n",
    "lot of reasonable tasks that we might care about um you know like obviously like\n",
    "predicting language or generating simulated language or performing very well at\n",
    "some simulated task and RL but it definitely seems like the next step towards\n",
    "like very general agents that are kind of you know I guess maybe I don't know if\n",
    "we want to use the term AGI but there something something more along the lines\n",
    "of a general agent that's kind of you know able to kind of self-improve and\n",
    "learn in more diverse environments um it definitely seems like that's kind of\n",
    "the next step of where machine learning will go and if we're going to get to\n",
    "that point I kind of agree with the idea that you know it certainly doesn't make\n",
    "sense to have some agent just randomly trying to gather completely random new\n",
    "knowledge like it certainly seems to make sense that you know you know even as a\n",
    "human to improve your intelligence you kind of selectively try and find out the\n",
    "areas in which like you can gather more more information or more knowledge and\n",
    "things like this and this is kind of what you know leads to this kind of I guess\n",
    "branching or you know like you said like the diverse set of things um that you\n",
    "might want to learn more about and so yeah I think like it clearly seems to make\n",
    "sense that like this kind of more openend this thinking is probably going to be\n",
    "like the next Paradigm of how we think about these kinds of systems but I I\n",
    "think M will had more to say about this I think the reason open-endedness is so\n",
    "interesting now is I think we're uh there's there's a few reasons why I think\n",
    "it's like newly relevant to this current ERA of machine learning because these\n",
    "ideas have been around for quite a while like um Ken Stanley Joel Layman um Jeff\n",
    "cloon uh Lis asaurus these a lot of these researchers they've they've been\n",
    "thinking about open-endedness and novel tbas search Divergent search for decades\n",
    "um I think it's really interesting to think about why there's sort of this\n",
    "Resurgence of these ideas now and I think a lot of it is because um it is again\n",
    "you know it's it's sort of following the same um sort of uh Tailwinds that have\n",
    "been driving a lot of the ml industry which is just like uh much better compute\n",
    "much larger data sets and I think what we're seeing now is that we know that\n",
    "modern deep learning methods work best when we can scale up the compute and the\n",
    "data that's how you get them to work um to to their Max small capabilities um at\n",
    "some point we're going to run out of data and a lot of people are now starting\n",
    "to talk about you know this as sort of a pending issue on the horizon which is\n",
    "you know at the current rate of consuming data for training our foundation\n",
    "models at some point we're going to run out of data we're where are we going to\n",
    "get the next trillion tokens from um and so I think a lot of this uh now points\n",
    "a lot of the interest to open-endedness because open-endedness is essentially\n",
    "you know we're studying systems that can generate their own data in an infinite\n",
    "uh capacity and so it's systems that essentially if you run it for longer and\n",
    "longer they get more and more complex they generate more and more quote unquote\n",
    "interestingness or interesting data um and so if we can actually you know crack\n",
    "this nut of how do we actually come up with a self-improving system in the sense\n",
    "that keeps generating interesting data uh we can then use that data to train\n",
    "further train our models but of course you get into this Perpetual uh data\n",
    "machine type of uh idea where obviously you know there's how do you generate\n",
    "more data uh if you know the data is ultimately from a model that you probably\n",
    "trained on previous data how do you get net new information from that well I\n",
    "think a lot of this is actually just resolved purely again going back to this\n",
    "idea of the reward function right or a preference function where there is\n",
    "outside information coming in through some sort of filtering criteria for\n",
    "example human designers in the loop uh or designers designing some sort of\n",
    "preference model that could essentially automatically rate the kinds of\n",
    "automatic uh data that's being generated by these open-ended systems and if we\n",
    "can do this kind of filtering we can essentially automatically find start to\n",
    "automatically find uh useful net new data net new trajectories net new even you\n",
    "know maybe um sentences like tokens or uh net new content to train our models on\n",
    "I've been thinking a lot about creativity recently and and I I think creativity\n",
    "is is is the other half of the coin of intelligence so in the world we live in I\n",
    "think that the intelligent process is is us we are a Divergent search and we are\n",
    "um basically tackling a complex search space and we are building knowledge and\n",
    "we we are mimetically sharing them in our society we're embedding them in our\n",
    "language and then language models come and acquire all of that knowledge so the\n",
    "cynical take is that AI today doesn't you know generalize and you it doesn't it\n",
    "doesn't creatively find new knowledge it just is a representation of the\n",
    "knowledge that we have found but it's not black and white is it so the work that\n",
    "you're doing is a great example of no no no you can generate new knowledge by\n",
    "exploring these complex search spaces and even though you're exploring existing\n",
    "models you're discovering interesting and novel combinations of those models\n",
    "that have not been found before so it's creating a novel margin on something\n",
    "that was not there before but I suppose the ideal future we want to get into is\n",
    "that we really can just from a far deeper level generate new knowledge yeah I\n",
    "think one interesting thing that I've been thinking about more recently you know\n",
    "is that um sort of the you know the high level question is just right now all of\n",
    "the state-of-the-art AI systems from chat gbt to stable diffusion style models\n",
    "for text image Generation all these systems they're they're amazing very\n",
    "impressive you know like 5 years ago I would not have believed that these\n",
    "systems could exist at this level of performance today but uh ultimately uh what\n",
    "they do is they're in the they're they're in the QA business so I basically ask\n",
    "these systems a question or I give them a command and they give me an answer um\n",
    "and so I think the next Frontier of AI is really how do we Design Systems that\n",
    "don't just answer questions but they actually are the ones that start to ask the\n",
    "questions and I think once we can have ai systems that start to ask interesting\n",
    "questions um that's when we start to get closer to I think traditional Notions\n",
    "of what uh strong AGI might be okay so so again really really interesting now so\n",
    "we're getting into agency and and people think that oh you could give a language\n",
    "model agency you just like you know run it in a loop and interesting things will\n",
    "happen well well it that's not true because the whole point of open-endedness is\n",
    "to prove that existing systems converge they don't diverge they don't accumulate\n",
    "information so we would need to create a kind of agent that like you know it\n",
    "would just keep running and it would just keep doing interesting and novel\n",
    "things it would keep accumulating information and I think that the reason why\n",
    "language models don't have agency is because they are essentially um a low\n",
    "entropy model and what what that means is during training a lot of the the sort\n",
    "of like the unnecessary um you know complexity was snipped off so the models\n",
    "only know about relevant things in the next step what's the next best token and\n",
    "it it feels like we would need to have not only a higher entropy search but we\n",
    "we would also need to have um a diverse set of models that are actively\n",
    "continually learning and and diverging from from each other but that's just my\n",
    "take I mean what do you guys think about that yeah I think that so I guess this\n",
    "relates quite a lot to this idea of like intrinsic motivation which is something\n",
    "that we utilize in our paper and I guess I guess the idea with that is like you\n",
    "know if we're trying to like gather new data in an environment like we shouldn't\n",
    "necessarily be constraint to just trying to new gather new data that's like good\n",
    "for a specific task um and so I I guess this kind of you know so intrinsic\n",
    "motivation basically says I should just gather new information because it's\n",
    "novel um and things like this and so we can basically like specifically try and\n",
    "gather information that you know reduces our uncertainty about the environment\n",
    "and and um or or similar objectives that that don't rely on some external reward\n",
    "signal and I think we when you get to the situation where the model is able to\n",
    "like self-improved in the absence of an external reward signal so intrinsic\n",
    "meaning that the the signal for what you should get is just purely generated by\n",
    "the model so it's purely intrinsic to the model um so I think the situation\n",
    "where you know you have the model that's able to self-improve without any\n",
    "external signal without a human having to Define what the reward is or what the\n",
    "objective is or this was good data this was bad data um I feel like that does\n",
    "feel like a lot closer to the notion of agency because of the fact you don't\n",
    "have kind of some external person defining what's good and what's bad and so\n",
    "yeah I think like this like and you also mentioned the word like creativity\n",
    "because I think at least in the context of things that I've done in terms of\n",
    "machine learning and reinforcement learning I think like intrinsic motivation\n",
    "feels like the closest thing related to creativity so you're basically like\n",
    "trying to gather information because it's novel or because you think it's or the\n",
    "model thinks it's interesting rather than because um you know it it satisfies\n",
    "some objective and so I think we could maybe say like intrinsic motivation is in\n",
    "some sense like an objective for being creative as well um I don't know if you\n",
    "have any thoughts about this yeah I think I think that uh it's I think there's\n",
    "definitely a a hugely deep connection between intrinsic motivation and uh\n",
    "creativity um in the literature intrinsic motivations also sometimes called\n",
    "artificial curiosity so this is a term that was coined by Jurgen Schmid Huber um\n",
    "could you could you explain it just what it is yeah yeah so oh yeah so taking a\n",
    "step back intrinsic motivation is essentially um in in reinforcement learning we\n",
    "train on reward signals and as Mark was saying um we typically train on external\n",
    "reward signal by external we mean that this is a task-based reward so this is um\n",
    "external in the sense that something outside of the agent that's learning like\n",
    "the human system designer decided that this is what the reward signal is for the\n",
    "task uh intrinsic means that we want to we don't design directly the reward\n",
    "signal but we're actually using some aspect of the model itself in order to\n",
    "drive the model's learning forward and so one example of this could be\n",
    "prediction error so if the model U has a large prediction error on a certain\n",
    "task like averaged over each time step we can use that as a reward signal and\n",
    "say hey you want to visit more parts of the environment where you're bad at\n",
    "predicting um how the state will transition when you act in that part of the\n",
    "environment and so uh as you can see this is very similar to maybe like intu\n",
    "Notions of what curiosity is uh curiosity and different forms of play um in the\n",
    "psychology literature A lot of people actually argue that you know different\n",
    "forms of play uh and curiosity really they they amount to you can model these\n",
    "behaviors as essentially a person trying to uh engage in activities where you\n",
    "know they're not very good at predicting the outcome and that's kind of what\n",
    "makes you could argue that's kind of what makes certain kinds of uh\n",
    "entertainment fun because or entertaining because you can't actually predict\n",
    "what will happen um you know in a few frames of the movie like like a movie\n",
    "wouldn't be very interesting or a book would not be very interesting if you can\n",
    "predict what will happen in the rest of the book just by reading the first few\n",
    "pages uh and so intrinsic motivation is really saying let's guide the model\n",
    "towards parts of the environment or the world or experiences where it's\n",
    "similarly unpredictable Stanley speaks about this this concept of deception or\n",
    "he calls it the false Compass which is this idea that any objective and and even\n",
    "you you could say exploring all of the search bace is an objective so he said\n",
    "every objective has deception and if you monotonically optimize any objective\n",
    "you will always lead into you know like a a deceptive part of the search page\n",
    "but then like the counterargument is say okay well let's let's not um let's not\n",
    "have any principles for doing the um you know the exploration let's just do\n",
    "something completely random and that doesn't seem very good so so then you know\n",
    "there's this concept of well how how do I how do I imuse some concept of what's\n",
    "interesting without falling victim to deception yeah so Ken Stanley uh has a\n",
    "famous essay in the realm of open endedness where he points out um that this\n",
    "notion of interestingness is uh ultimately a subjective concept and so even in\n",
    "the case of intrinsic motivation which I think is you know in practice we can\n",
    "get a lot of mileage out of this um and we've seen this in a lot of domains\n",
    "where uh exploration helps a lot like even in the Waker paper it's largely\n",
    "founded on this idea on how we exploit intrinsic motivation uh for learning\n",
    "World models but um ultimately you know these these modelbased uh measures of\n",
    "intrinsic motivation they are by definition based on the particular model at\n",
    "play and so um at some point you know you're you're starting to overfit to what\n",
    "that specific model finds interesting and of course what that model finds\n",
    "interesting if your measure of interestingness is something like a prediction\n",
    "error um is going to be a function of you know the specific architecture of the\n",
    "model the actual inductive biases of that model uh the capacity of that model to\n",
    "learn and so you could imagine a model where you know at the beginning it's\n",
    "looking for lots of interesting parts of a particular video game environment but\n",
    "at some point you know it might saturate what it can represent and what it can\n",
    "learn and at some point it might start to find things it's explored before\n",
    "interesting just because it's starting to forget those parts of the environment\n",
    "you know if you have like a very rich stream of different kinds of environments\n",
    "that it's exploring so ultimately this is like an example of deception because\n",
    "now it's like I I think that my model is the model thinks it's exploring parts\n",
    "of the environment that it finds interesting based on this prediction error But\n",
    "ultimately it might actually start to go back to other parts of the environment\n",
    "because of issues of model Capac capity and another really famous example of\n",
    "this issue would be like the noisy TV so like if your environment has you know\n",
    "this this um noisy TV where it's just showing random noise random RGB pixels um\n",
    "you know that's you know that's not something you can actually predict because\n",
    "it's just noise and so the model if your intrinsic motivation is really just to\n",
    "search for novelty in the form of prediction error it might just start staring\n",
    "at this TV forever because it's something that it just can't predict and it'll\n",
    "just by looking at that TV it'll be maximizing its prediction error yeah yeah\n",
    "it's so interesting um so so just coming into Rich suton a little bit so he had\n",
    "this idea called um reward is enough and and essentially that that's making the\n",
    "case that you know just using um implicit uh motivation all the stuff that that\n",
    "you've just been speaking about using this trajectory um you know optimization\n",
    "process that we can do everything we need to do and in in your paper you're kind\n",
    "of making an argument similar to what lcon has been making for years about\n",
    "self-supervised image learning that what we should do guys is let's let's kind\n",
    "of pre-train a base model so this model um understands environmental Dynamics\n",
    "really well and then we stick a reward in there and we build um agents after\n",
    "that so does it in any way reenforce or pun intended uh Sutton or or or or do\n",
    "you think it's still complimentary I think it's still complimentary at least if\n",
    "I understand the the meaning of the reward is enough paper because my\n",
    "understanding of that um line of thought is basically saying that you know we\n",
    "can kind of specify you know any task that we might might want an intelligent\n",
    "agent to do as optimizing a reward in some like mdp or prdp so Mark of decision\n",
    "process or something like that and I think our work isn't contrary to that in\n",
    "the sense of like you know I I do think that that probably is a sufficient\n",
    "framework to be able to model any any kind of behavior that we might want an\n",
    "agent to do but I think when it comes to actually like practically implementing\n",
    "that idea there's a lot of difficulties so the first one might be um you know\n",
    "how do we even specify that reward function so you know if the reward function\n",
    "is to um have a good life or something like this like there's obviously like you\n",
    "know maybe there is some like numerical way of defining that in terms of an mdp\n",
    "but there's like not actually a good way of of writing down that function that\n",
    "Maps what I do to whether I'm getting good rewards and so I think there's this\n",
    "kind of like you know I think that's a good framework for like thinking about\n",
    "any problem but then you have these kind of like practical issues of how do you\n",
    "actually Define rewards and how do you how do you say like whether an agent's\n",
    "doing well or not doing well and things like this um and so I think that's still\n",
    "um even with the world models lines of work I think that's still like kind of\n",
    "quite a difficult issue so so so the world models lines of work kind of you know\n",
    "allow you to model you know predicting a in the environment which is a very\n",
    "useful thing for doing a lot of tasks um but then if you actually want to\n",
    "optimize some specific task you still have this problem of like how do you\n",
    "define the reward and so we eventually want to get to this point of being able\n",
    "to like inject a reward into the world model so we're kind of in agreement with\n",
    "that kind of line of thinking in the sense we're eventually going to use a\n",
    "reward to derive the the the desired intelligent Behavior so I don't think\n",
    "there's any conflict in that sense but we still have this kind of problem of how\n",
    "do we inject that reward into the the world model how do we Define what that\n",
    "reward should be um and the case of um you know one of the easiest things to do\n",
    "for example would just be to label each image with a reward and then you can\n",
    "kind of encode that image into the latent space of the world model and then use\n",
    "that to Define how good a certain thing is and that's kind of the style of\n",
    "thinking we think of on our work um but I don't think that overcomes this like\n",
    "overarching issue of in general it's you know rewards can Define everything but\n",
    "how do you in practice like get that function is pretty hard yeah yeah I mean in\n",
    "a sense reward is enough is sort of a tonology because once you know the reward\n",
    "um if you know the the reward function for your environment you can essentially\n",
    "compute the value function which gives you the optimal policy um and so reward\n",
    "has to be enough if you know the reward function and so uh I think the more\n",
    "interesting question is definitely like what is enough for the reward what is\n",
    "enough to actually have a system automatically figure out what are interesting\n",
    "new rewards for us to train new agents or new models on or continue training\n",
    "existing models on um and I think this goes back to the question of environment\n",
    "Design This is largely the motivation of that line of work this autoc curricular\n",
    "environment design where essentially if we can automatically weave through this\n",
    "path of possible environments of the design space of the environments the design\n",
    "space uh clearly will Encompass like a big part of the design space is also\n",
    "encompassing the reward for those tasks and so essentially we want to find a\n",
    "curriculum automatic curriculum or path through the possible reward functions in\n",
    "which we can start to train a more and more General agent but then the\n",
    "interesting question is again like what exactly is the right notion of\n",
    "interestingness in order to drive that curriculum that path through the design\n",
    "space of possible things we could be training our model or agent on and um and\n",
    "that's I think one of the most interesting open questions and it relates to the\n",
    "question as well of how do we get the model to ask the questions um because\n",
    "really what drives humans in terms of asking further questions uh is our own\n",
    "implicit notion of interestingness which is informed by things like the\n",
    "scientific method and you know being able to create explanations about the world\n",
    "and we find things interesting when we can't actually explain some phenomenon\n",
    "about the world uh B based on existing theories or explanations and so I think\n",
    "what's really missing for a well-grounded you know human interpretable version\n",
    "of interestingness is having models that can essentially come up with their own\n",
    "theories about the world and start to probe those theories for where there's\n",
    "mismatch between you know the their learned theory of the world and evidence\n",
    "that new evidence that they find from experiences in the world yeah it's so\n",
    "interesting and and um I mean when I make the argument that agent should be\n",
    "physically and socially embedded it's it's actually quite a simple argument\n",
    "which is just the guard it it's that interestingness thing I think that that is\n",
    "how you know having um agency but with the guard rails of our physical and\n",
    "social embedding so you know we're we're sampling things that make sense because\n",
    "they're already there that you know but but but obviously we can go off piece a\n",
    "little bit as individual agents I I feel that that that's what helps that\n",
    "process just coming back to suton it's entirely possible that I've misunderstood\n",
    "suton by the way so my my interpretation of of reward is enough and it might be\n",
    "true as you say that it's tautological given that if you already knew the reward\n",
    "function for particular environment then it could do everything that it needed\n",
    "to do but my interpretation of of reward is enough is that it would lead to um a\n",
    "general intelligence and you know General in the kind of magical sense that it\n",
    "would work in in any possible situation but if it is specialized in the way that\n",
    "we agreed earlier that there exists a a reward function which would in you know\n",
    "codify motifs and things that you know you need to know or optimize in a\n",
    "particular environment or set of environments then to me that's still\n",
    "specialized intelligence and I would great yeah yeah that's that I think that\n",
    "aligns with my take as well where I think if you have a reward function um it's\n",
    "already sort of applying uh largely applies to at least the examples in that\n",
    "position paper about reward is enough it seems like most of the reward functions\n",
    "they discussed are largely um grounded in a specific task and I think that if\n",
    "you have the reward function for a specific task then it definitely seems that\n",
    "you can have some optimization or learning algorithm that essentially learns to\n",
    "optimize that reward and therefore achieve that task um so I do think sort of\n",
    "the open question that uh it I think saying reward is enough I think it kind of\n",
    "passes the buck up further one level to the question of where that reward comes\n",
    "from and I do think that having systems that can automatically design\n",
    "interesting new rewards that seems like the frontier yeah I I agree and and you\n",
    "know to me intelligence is about discovering the knowledge and the knowledge is\n",
    "the reward function so it feels like kind of baking the knowledge in into the\n",
    "system um okay so another sort of Galaxy brain take is um I was talking to\n",
    "Bishop about this the other day and um do you think of like deep learning models\n",
    "as one model or do you think of them as a sort of like intrinsic Ensemble of\n",
    "models because they they get you they behave differently in an input sensitive\n",
    "way so you know like depending on the prompt you put into language into a\n",
    "language model you might find that like a different part of the weight space\n",
    "gets activated and essentially it's like retrieving a mini program and that\n",
    "program is being run but it's not it's not model building it's like model\n",
    "retrieving but would you agree with that H I guess I'm not sure about the like\n",
    "like within like subsets of a single homogeneous model but I guess the thing\n",
    "that I like to think about that's I think quite related to this is this idea of\n",
    "like and I think Yan Lon also kind of well a lot of people have laid out like a\n",
    "similar architecture as like you know should we think of intelligent agents as\n",
    "having kind of like separate subsystems that can maybe like be thought of as\n",
    "different your networks and so you know we could have like you know um the\n",
    "standard notion of a policy which is like outputting actions and maybe we also\n",
    "want to have the notion of like a prediction model more like a world model that\n",
    "predicts what might go ahead in the world as well as maybe like a planner that\n",
    "is somehow good at like optimizing in that model and so we could kind of think\n",
    "of all these things as like separate subcomponents that we assume an intelligent\n",
    "you know an intelligent thing would have like an intelligent thing should be\n",
    "able to predict ahead on the world it should also be able to Output actions it\n",
    "should hopefully maybe be able to infer like why other things happened and\n",
    "things like this and so I guess as to whether we think that should you know be\n",
    "just like one homogeneous model um for which maybe you query it and maybe you\n",
    "know different aspects of that model would kind of um you know handle different\n",
    "aspects of the query or that we should think of those as separate components I'm\n",
    "not really sure as to whether it matters whether they're separate components or\n",
    "not because yeah I agree that you probably could just have like one massive\n",
    "model that does all of these things and I think at least from the the trend that\n",
    "I've been seeing um in kind of the world models literature and and also just\n",
    "like I guess the RL lit or maybe just we should call it the foundation model\n",
    "literature is you kind of don't want to have like a a separate model that does\n",
    "the prediction for actions and a separate model that does the prediction of\n",
    "observations like why not just have one massive model that's jointly trained to\n",
    "predict everything you might want to query and then depending on the different\n",
    "query you know it will just either predict an action or predict a video sequence\n",
    "or it can be condition on actions or condition on language so I think in this\n",
    "sense like this kind of model like you said is more like just one massive model\n",
    "but it kind of has like lots of different subtasks that it's able to do um and\n",
    "so maybe this is actually like the more effective way of training a model\n",
    "because then you kind of get generalization across these different subtasks as\n",
    "well yeah and the reason I'm asking the question is um it seemed I mean like you\n",
    "know for for an outside are coming in it looks like statistics is broken you\n",
    "know in the olden days we used to talk about the my fre lunch theorem used to\n",
    "say like you know you need to have specialized models for different situations\n",
    "and now the narrative is that we have generalist models we have Foundation\n",
    "models and and they are better than the specialized models in a strong sense and\n",
    "you know and I like to sort of push on this a little bit and see well when when\n",
    "does it break because we know that there are like these physics inspired models\n",
    "with inductive priors that you know know know about invariance of you know like\n",
    "molecules in drug Discovery and stuff like that and surely they would be better\n",
    "than a language model but no no no now they're training language models on\n",
    "mathematical conjecturing and like you know like um drug formula using tokens\n",
    "and so on so you know as an outsider you might just think well we can just use a\n",
    "big transformers model for everything I I think a lot of this does come from um\n",
    "well so I think the attention-based Transformer architecture is proven\n",
    "empirically to just be highly scalable highly effective at learning lots of\n",
    "different kinds of data distributions um but I think also part of it is just\n",
    "that we're just starting to enter this regime where we're just training these\n",
    "models on an insanely large amount of data and I think that a lot of times we\n",
    "need to sort of take a step back and really consider the amazing performances on\n",
    "different tasks and really think about you know how much information was\n",
    "actually leaked into uh this task in the training data because um right now\n",
    "we're really just training uh these huge models on I think I would say that\n",
    "we're largely training them on the test distribution in many cases um I do there\n",
    "I have seen like lots of examples of uh truly impressive behaviors from these\n",
    "models that that do seem like uh truly novel like zero shot General ization to\n",
    "unseen tasks uh like there was a recent example I saw on Twitter where someone\n",
    "uh had like a very low resource like rare language and they gave it a few gave\n",
    "they gave I think the cloud 3 Model A few examples and it was able to\n",
    "essentially perfectly reproduce uh new utterances in that language uh so that\n",
    "does seem very impressive um but it does seem at the same time you know a lot of\n",
    "the performances for example on elsat or like AP biology exams I imagine a lot\n",
    "of that is really a function of just uh literally giving the model uh the test\n",
    "domain in terms of information during the training step okay okay so there are\n",
    "like two schools of thought on this when we talk about world models you know\n",
    "people are talking about Sor and is it building a world model and and it\n",
    "certainly seems to be it seems to be doing I mean obviously it's not doing Navia\n",
    "Stokes it's not doing like fluid dynamics but but it seems to be doing something\n",
    "like that so like one one extreme view is that it it is just a hash table and\n",
    "you know it's it's kind of doing some diffused approximate retrieval or whatever\n",
    "another school of thought is that it's like a simulator and you know people talk\n",
    "about the simulator's view of large language models and you know like it's like\n",
    "it's modeling not only you know just just just the words and the language but\n",
    "it's also implicitly learned to model the world and the people and and all of us\n",
    "so that's the Spectrum I mean like Mark where where do you think these things\n",
    "are on that Spectrum yeah I think we like it would be great to be able to play\n",
    "around with it and kind of see what we can get out of it but I think I think if\n",
    "you can for example you know after each kind of you know so it's a language\n",
    "condition model so if after each kind of frame you could for example put in a\n",
    "different langu um language kind of conditioning and say like you know what\n",
    "happens here if you know the mug was pushed off the table instead of whatever\n",
    "else was originally happening in the video and so if you can basically do this\n",
    "kind of like counterfactual or like Interventional predictions where you kind of\n",
    "give some new action and then you're able to see like the alternative outcome of\n",
    "that new action I think if the model's able to do that then I would think that\n",
    "it does have a pretty good understanding of how the world works in the sense of\n",
    "you know I really think like if you can predict the outcome of any action given\n",
    "some sequence of observations I do think that's a pretty good proxy for being to\n",
    "say if you can do that you really do understand how the world Works um and so I\n",
    "think if the model can do that I I would be kind of inclined to say that it does\n",
    "have like a kind of world model in the sense of understanding the underlying\n",
    "world but then there might also be a chance that you know you know these models\n",
    "aren't like you said it's more just like a diffuse retrieval and and perhaps if\n",
    "you try and do like a very fine grain conditioning on a slightly different\n",
    "outcome um different like conditioning maybe it won't actually give you the\n",
    "correct kind of counterfactual prediction and so I think maybe we'd have to see\n",
    "how good these models are at generalizing to to slightly different inputs and\n",
    "things like that to really see if it understands things well or it is just like\n",
    "kind of generating some arbitrary video yeah I think it's a double whammy\n",
    "because our colloquial use of language and like you know use of models and\n",
    "intelligence is so static that like you know we we we um we think of of that as\n",
    "being intelligence but but we're still going like we we're now create we're\n",
    "creating knowledge right now we're creating models because we're exploring we're\n",
    "doing exactly what you said mchi we're like we're exploring the search space and\n",
    "we're building models and we're combining them together and you know presumably\n",
    "would diverge quite quickly from from from the language models but I mean what\n",
    "what's your take on on this idea that they are you know potentially World\n",
    "simulators yeah um so just regarding the the sort of lookup analogy for these\n",
    "large models I I think it's so my mental model is similar to that um although I\n",
    "think it's it's very close to um I think a really good write up of of the of\n",
    "this alternative take um which is more like there's an alternative take which is\n",
    "that it is kind of like a lookup table but the prompt itself is a key that maps\n",
    "not to a specific sort of response but to potentially like a function aast space\n",
    "of functions and fr had a really good um sort of blog post where he kind of goes\n",
    "more into the details of this viewpoint but I think that that really you know\n",
    "resonates with my intuition of how these things behave where it's not literally\n",
    "looking up like um a key value in a hash table it it seems more like it's these\n",
    "models have learned over tremendous amounts of data to compress that data they\n",
    "have to learn um I think more abstract functions that helped to explain that\n",
    "data and therefore they're learning functions so they're approximating some kind\n",
    "of function uh or a vast family of functions and I think the prompt really acts\n",
    "like as a key that essentially activates a particular function and so you can\n",
    "kind of think of you know in the classical world where one neural network equals\n",
    "one function like basically it's mapping from images to image net labels now\n",
    "like Foundation model in the foundation model regime it's like one Foundation\n",
    "model is essentially kind of like a giant database of lots and lots of lot\n",
    "different functions um that's basically activated selectively based on the input\n",
    "or the prompt um and I I do think that you know based on this I think it's\n",
    "definitely possible that with enough data from the world enough experiential\n",
    "data that these Foundation models can learn sort of a basis set of Dynamics and\n",
    "transitions that explain how the world Works um and essentially if it does learn\n",
    "these transitions um for example in like the massive amount of video data that\n",
    "sore is trained on um I would say that yeah I I would agree that they are\n",
    "essentially starting to approximate uh World models sure yeah so yeah these are\n",
    "two um separate papers so so the first one being dreamer led by like Dan and jar\n",
    "half so this is um you know example of work in the space of world models and so\n",
    "basically what dreamer involves doing is like a way of training a world model\n",
    "and then also showing that you can just generate synthetic data in this mod\n",
    "model and then optimize decision- making like purely using the synthetic data um\n",
    "so we talked a little bit earlier about like partially observable mdps so we\n",
    "want to like take kind of the sequence of observations um and then be able to\n",
    "predict like the next a distribution over the next observation given some action\n",
    "and so so we also talked about how you might want to like compress this into\n",
    "like a um more compressed representation of of the previous observation so\n",
    "basically what dreamer proposes to do and a lot of works on world modeling is to\n",
    "take your previous sequence of observations and then you map them to some\n",
    "compressed representation and then could predict ahead in this latent space um\n",
    "the next uh latent um latent State condition on the action and then yeah the\n",
    "really interesting about this is that now um you know we can in general predict\n",
    "what's going to happen to condition on different actions so now if you want to\n",
    "get like interesting Behavior out of something like dreamer you can then go\n",
    "ahead and generate a lot of synthetic data using dreamer or the dreamer World\n",
    "model and then use that to optimize behavior and so in Dreer basically the way\n",
    "it's done is by doing like on policy reinforcement learning in the world model\n",
    "so a lot of people call this like reinforcement learning in imagination so it's\n",
    "basically you know you're imaginating a bunch of synthetic data then using that\n",
    "to like use some standard reinforcement learning algorithm and then optimize um\n",
    "behavior in some sense um and then you could also do other things like Monti\n",
    "research which is like closer to like the works on on muso and things like this\n",
    "creativity is a little bit like a cloud and all the creativity only happens on\n",
    "the surface of the cloud so there's this interesting thing that like Creative\n",
    "Discovery depends on the history of all the things that are discovered before\n",
    "and typically like new discovery only happens at the end of the chain not back\n",
    "in in the middle tinkering exactly and and there's also this notion that\n",
    "creativity happens through knowledge so like knowled new knowledge doesn't come\n",
    "from The Ether it's kind of there's some creative component to it but it it's\n",
    "it's on the um the the trodden path of existing knowledge that we already have\n",
    "yeah that wasn't a very good question but so when when we talk about imagination\n",
    "through like you know like reinforcement learning policies and so on what we're\n",
    "saying is like you know I'm I'm imagining all of these like possible you know\n",
    "worlds and so on but I'm using the cognitive Primitives of all of the stuff that\n",
    "I already know yeah I think knowledge is definitely U compounding uh compounding\n",
    "um artifact uh that's basically like the culmination of everything all the\n",
    "experiences that we uh that we encounter like throughout our whole life and\n",
    "through also like Beyond you know going backwards Beyond like even our\n",
    "individual lives into like the cultural knowledge that's shared and uh what's\n",
    "really cool about language models is that they are essentially um a codification\n",
    "of cultural knowledge and so uh Jeff cloon has this concept of AI generating Ai\n",
    "and so he's got multiple pillars of essentially what it takes for uh you to have\n",
    "ai systems generate General AI systems and he recently added actually like as a\n",
    "fundamental piece of this in in his framework uh this idea of building on top of\n",
    "foundation models and so he says he calls it like standing on the shoulders of\n",
    "giant Foundation models um which is I think really just um sort of the ml\n",
    "equivalent of building on top of cultural knowledge there's there's a real shift\n",
    "recently towards talking about um synthetic data and as we were just saying like\n",
    "you know synthetic data doesn't come from the epha so we already know stuff\n",
    "about the world we we build simulators and we kind of generate new information\n",
    "but in the neighborhood of things that we already know and then we kind of like\n",
    "iterate and fine-tune on the generated data um what what what do you think about\n",
    "that process yeah no I think yeah maybe I'll bring it back to this like the plan\n",
    "to explore line of work so yeah um so so basically like the motivation of that\n",
    "kind of work is like kind of saying you know we might have some like previous\n",
    "data set or something and we've trained our world model on that data set but we\n",
    "really want to go out and like gather more data and then like improve the world\n",
    "model um um by gathering more data and so we can use things like intrinsic\n",
    "motivation to then give us like a reward signal within the world model so in the\n",
    "sense of something like prediction error which me mentioned earlier so now we\n",
    "can basically like train a policy in the world model that's now not trained for\n",
    "a specific task but it's trained to go out and gather information in the world\n",
    "so basically now you know you do this imagining in the world model to imagine\n",
    "aead but instead of imagining ahead how do I do a task well you're imagining\n",
    "ahead how do I get to states that I don't know what happens and therefore will\n",
    "learn more and that's basically like the motivation behind plan to explore um\n",
    "and then in our um paper Waker it's it's kind of like inspired by plan to\n",
    "explore as well as works on like Auto curricular and so basically what we're\n",
    "trying to say is you know plan to explore is good for for getting an agent to go\n",
    "out and gather data um within a single environment and you know and presumably\n",
    "once you've gathered enough data within a single environment then you can\n",
    "generate a bunch of synthetic data in that single environment and then do what\n",
    "we discussed with dreamer in terms of like optimizing a policy for that very\n",
    "specific environment um but what we're really interested in is saying you know\n",
    "let's not assume that we have like one specific environment beforehand let's\n",
    "assume that you know there's some space of you know broad range of scenarios\n",
    "like we want a very like General agent there might be a bunch of different\n",
    "environments and then within that those different environments we kind of want\n",
    "to be able to to handle absolutely any task and so in the Waker paper we're\n",
    "basically saying like you know how should we gather the data within um within\n",
    "this like broad space of possible environments and tasks such that we can train\n",
    "a very good World model and then once we have that world model that's kind of\n",
    "like capable across environments and tasks you know the assumption is that we\n",
    "can then use that to generate good synthetic data which we can then um use to\n",
    "optimize behavior and so maybe to talk a little bit about like how we formalize\n",
    "this problem um so you know we mentioned earlier this idea of like the\n",
    "simulation Lemma so we basically say that or an existing work that says like in\n",
    "a single environment we can bound the gap between the optimal policy that's\n",
    "trained in the world model so trained in the synthetic data to the to the truly\n",
    "optimal policy by the error in the world model and the distribution of States\n",
    "generated by that policy so it's kind of intuitive like the world model should\n",
    "have you know low error and then we will get a good policy out of it but then\n",
    "what we're trying to say is like now let's assume we don't know what the\n",
    "environment is beforehand and we also don't know what the task is beforehand so\n",
    "how do we get like a good World model that can handle like all of those\n",
    "situations when we later want to go ahead and optimize some task um and so the\n",
    "way that we do this is we basically yeah we then use this notion of min max\n",
    "regret to say that the policy should have like low maximum regret across this\n",
    "entire space of environments and then using the simulation Lim we can basically\n",
    "say now now the um the world model has to have low error across all environments\n",
    "under the distribution of States generated by the optimal policy for any future\n",
    "task um so we're going to say like yeah the world model has to be good for any\n",
    "environment and under you know in any area that the policy might go to that's\n",
    "relevant to the Future tasks and then what we kind of say in the paper is you\n",
    "know if we want a truly General agent we're not going to know what the\n",
    "distribution of tasks is beforehand so we don't know we don't know what the\n",
    "reward function is we don't have a set of reward functions um you know we're\n",
    "just going to kind of assume the agent has to do anything later down the line\n",
    "and this is kind of like related to this idea of like open-endedness that we've\n",
    "talked a lot about and so if we don't know what the task is going to be like\n",
    "later down the line um then the best assumption we can do is say that you know\n",
    "it could be any reward function later down the line um which is maybe not the\n",
    "best assumption because as we talked a bit earlier if you're just kind of you\n",
    "know we talked about a bit about intrinsic motivation and interestingness and if\n",
    "you kind of assume the task can be absolutely anything later down the line\n",
    "you're kind of assuming that you know the agent might want to do something\n",
    "completely ridiculous later like it if you do this in robotics that might mean\n",
    "the task is just to do like back flips later or something like that but you have\n",
    "no interest in doing that so it's it's not clear if that's really a good\n",
    "assumption about how we should think about what tasks might be interesting later\n",
    "but that's the Assumption we make so we assume the task can be absolutely\n",
    "anything later down the line um so so now we have to get a to the point where we\n",
    "have the world model which is good for any environment and under the\n",
    "distribution of States generated for any task or any optimal reward function um\n",
    "and to do this we basically like Leverage two different techniques so to\n",
    "generate this state um so to handle the aspect that we don't know what the task\n",
    "is later down the line we assumed that um we have an intrinsically motivated\n",
    "policy that's basically seeking out the maximum uncertainty in any single\n",
    "environment and so basically if if this um if this intrinsically motivated\n",
    "policy is seeking out the maximum certainty in every environment um it's kind of\n",
    "like estimating for us what the maximum uncertainty is in every environment\n",
    "because it's like actively finding uncertainty in every environment so now we\n",
    "have a policy that's finding like the maximum uncertainty in every environment\n",
    "and then if we want to um optimize this like Minimax Criterion across\n",
    "environments we kind of need the maximum uncertainty to be low across all\n",
    "environments so so we kind of have to have like um you know this policy isn't\n",
    "able to find like lots of big errors across all different environments um and so\n",
    "basically you know what we could think might might might what happened in\n",
    "practice is you know you could imagine there are a bunch of different\n",
    "environments some which are like low complexity and some of which are high\n",
    "complexity and if we just kind of naively sampled from those two different\n",
    "environments data you know our world model is going to very quickly get good at\n",
    "the low complexity environment and then it's going to leave lot more data from\n",
    "that high complexity environment to eventually get the errors low in the high\n",
    "complexity environment so to bring it back to the title of the paper which is\n",
    "weighted acquisition of knowledge across environments for robustness so the idea\n",
    "here is that we're basically going to change how we sample that distribution of\n",
    "data across environments to make sure that maximum uncertainty stays low across\n",
    "environments so what this ends up looking like is you know we're going to sample\n",
    "less data from the environment that has lower complexity and then we're going to\n",
    "actively sample more data from the environment that has higher complexity such\n",
    "that we we bring those errors down on the higher complexity environments and I\n",
    "guess this is a little bit different to existing works on curricular because\n",
    "normally in curricular like automatic curriculum learning you kind of assume\n",
    "that you have some reward function which is telling you how well the policy is\n",
    "doing in each environment and you use use that specific like metric of how well\n",
    "the policy is doing to determine um you know where the policy has more potential\n",
    "to learn but because we're making this assumption that you know we don't know\n",
    "what the reward function is we're we're trying to get a general agent that can\n",
    "kind of do any task any reward function um we don't assume that we know that\n",
    "reward function beforehand so we can't use reward as a metric of saying like I\n",
    "need more data from here or I need more data from here but then kind of the main\n",
    "argument of the paper is showing that you know if if we just think about this in\n",
    "terms of prediction era in the world model like we can actually use that as like\n",
    "an intrinsic motivation signal to say you know does the agent need to gather\n",
    "more data from this environment or from this environment without access to a\n",
    "reward function and so we could kind of think of um this work as kind of a more\n",
    "General approach to automatic curriculum learning in the sense of like we're not\n",
    "assuming that that you have a reward function beforehand we're kind of agnostic\n",
    "to what the task is um and because and to kind of distill that knowledge that's\n",
    "that's gathered without the reward function we use World model as a mechanism to\n",
    "like distill that knowledge because if you just like naively have an agent\n",
    "gathering information with a reward function um you know how do you how do you\n",
    "kind of put that knowledge into the agent and we kind of argue the best way of\n",
    "doing that is the world model um so that's kind of a summary of like the Waker\n",
    "paper and like what the ultimate algorithm ends up doing so I mean essentially\n",
    "you're doing a high entropy search so you're you're leaning into um areas of\n",
    "complexity and you're building a higher complexity model which goes against the\n",
    "grain of of the intuition of like oam's Razer that we should have simple models\n",
    "so you're you're almost deliberately saying no I I want I want to model the the\n",
    "complexity and and have more of that and then the other interesting thing is\n",
    "like from from a a curriculum learning point of view I think traditionally we\n",
    "did explicit curriculum learning and you know we might have some principles\n",
    "around having a monotonically increasing curriculum of complexity whereas here\n",
    "by leaning into um environments where we do worse on so we're selecting them\n",
    "based on prediction error we're actually implicitly getting a kind of\n",
    "monotonically increasing complexity which just happens to work really well yeah\n",
    "I I guess actually it actually almost ends up being in the opposite direction so\n",
    "so by leaning into the the the higher complexity environments more we're kind of\n",
    "saying let's prioritize the harder environments more to begin with so let's like\n",
    "gather more data in in the higher complexity environments um you know cuz I\n",
    "guess in intuitively if you kind of want to be good across all environments you\n",
    "kind of need more data from the higher complexity environments and we don't\n",
    "really explicitly think about an ordering of going first from easy to hard um I\n",
    "guess that maybe there is a good something to look into there because you know\n",
    "like a lot of these Works go from low complexity to high complexity because it's\n",
    "kind of easier to learn an initial policy that can kind of do something in the\n",
    "low complexity environment and then you build up the complexity gradually um but\n",
    "I think that that idea is most useful when you know what the task is so you\n",
    "could imagine if the task is like Locomotion if it's walking you kind of want to\n",
    "First loan a policy that's able to walk on flat ground and then maybe gradually\n",
    "build up the complexity like add and bumps and then eventually it can walk on\n",
    "like very complicated terrain so it kind of makes sense to go from low to high\n",
    "complexity um but in this work we're focusing on Purely intrinsic motivation\n",
    "meaning that the policy is not trying to learn a specific task it's trying to\n",
    "just seek out um uncertainty and like reduce uncertainty and so we don't really\n",
    "have the the notion of you know you first need to be able to learn how to do\n",
    "something on an easy an easy environment and then towards harder environments\n",
    "because there is no specific task that we're trying to learn and so I think for\n",
    "this reason you know we wouldn't didn't really focus on this notion of moving\n",
    "from easier to hard environments that actually you know we're consistently\n",
    "samping more data from the hard environments and I guess I think this relates or\n",
    "I think this is something that you brought up when we when we worked on this is\n",
    "like you know I think we can really relate this idea to like a lot of different\n",
    "contexts including things like like language models for example um so you know\n",
    "you could imagine if I'm training in LM I don't really necessarily have this you\n",
    "know not really a reward function in some sense you're just trying to do like\n",
    "unsupervised prediction um and so you know we could for example take the\n",
    "prediction era of like a language model in a in a bunch of different domains and\n",
    "say you know the language model is not very good at predicting a language about\n",
    "some certain task or something like that and and you know we could say you know\n",
    "and intuitively the same thing kind of holds if it's not very good at predicting\n",
    "you know um what the next token is in French like we should presumably gather\n",
    "more data in French um and so that kind of gives us a way of like actively\n",
    "Gathering the appropriate data um and so yeah I think this idea of like\n",
    "Gathering more datab based uncertainty obviously is a very general idea like the\n",
    "idea of like Active Learning um but we kind of like specialize that into\n",
    "thinking about how do we think about this in terms of the reinforcement learning\n",
    "setting it might be um interesting to talk about as well like sort of because we\n",
    "looked at some of the metrics as well right the environment complexity metrics\n",
    "we don't have the external notion of difficulty but we we also did look at sort\n",
    "of the emergent uh curriculum yeah yeah yeah gotcha yeah so I guess um so it\n",
    "kind of dependent on the environment so in some environments you just kind of\n",
    "got this like very straightforward behavior of like you know consistently gather\n",
    "more data in the more complex environment um but because we're we're actively\n",
    "trying to gather data um of the the environments for which the uncertainty is\n",
    "the highest kind of this curriculum could change o over over the course of\n",
    "training so so what happened in some of the other environments for example is\n",
    "that initially all the environments are just like high un certainty like there's\n",
    "like all environments are kind of misunderstood therefore like sample all\n",
    "environments like equally more or less to just get a rough understanding uh and\n",
    "then you know as as the model would improve on the simplest environments then we\n",
    "would see like more and more emphasis towards sampling the highest complexity\n",
    "environments so I guess in that sense we would get something to more like kind\n",
    "of what you said in terms of like a standard curriculum but but a bit different\n",
    "in the sense of like initially everything is uncertain so we're just going to\n",
    "sample everything uniformly um but then we kind of get a better understanding of\n",
    "which of the environments you know the uncertainty remains high on these higher\n",
    "complexity ones and those are the ones we need to like go and gather more data\n",
    "yeah I mean I I can see this both ways I mean certainly from a like a basian\n",
    "optimization point of view that there's something to be said for um you know\n",
    "this is where I'm uncertain go and gather more data where where I have highest\n",
    "uncertainty and as you say like traditionally in curriculum learning we are told\n",
    "that we need to have monotonic increasing complexity but as you just said that's\n",
    "when we have a particular task in mind now neuron networks they're a little bit\n",
    "like a block of clay aren't they so you know it's it starts off with abject\n",
    "complexity and then we do stand you know we do um stochastic gradient descent\n",
    "and we chip away at the clay and we kind of build we sculp a statue that that\n",
    "that we want to build and I'm just trying to get an intuition here so like with\n",
    "with this maximum entropy um search you know like high entropy search what we're\n",
    "doing is is we're saying okay well here are some complex models and these models\n",
    "must contain motifs that tell us a lot of information it's a little bit like the\n",
    "ELO algorithm in chess you know you actually get Information Gain when something\n",
    "surprising happened so here's a big um block of complexity and I'm going to try\n",
    "and infer what the motifs are in that complexity that that explain the\n",
    "information that I'm missing I think that um a lot of this ultimately traces\n",
    "back to sort of there's like this like fundamental pattern towards uh I think\n",
    "that like ties a lot of these ideas around active um active experiment design or\n",
    "like active sampling which is and all all these autoc curricular methods which\n",
    "is you essentially want to devise uh what you know nowadays we call a\n",
    "self-supervised objective or self self-supervised training algorithm um where\n",
    "essentially you have the system essentially use signals it produces itself um\n",
    "during the training or evaluation process in order to drive itself forward in\n",
    "terms of deciding what future data to train on and so you know we sometimes call\n",
    "these kinds of systems Auto curricula as well because it's automatically\n",
    "generating this curriculum of tasks to train on and I think the sort of like the\n",
    "fundamental connecting um uh pattern here is just that this the signal that we\n",
    "use to drive the training it's always going to be based on something like uh an\n",
    "uncertainty signal or um going back to the open-endedness literature something\n",
    "like a classic notion of interestingness and I think there's just a lot of\n",
    "different possible choices for this metric and so one for example we talked a\n",
    "lot about Mini Max grab so regret could be one of these driving signals because\n",
    "it measures the existence of a performance Gap and therefore probably an\n",
    "information Gap as well in terms of learning to master those tasks with high\n",
    "regret um but also uncertainty is also another one it ties back to novelty\n",
    "because novel environments you will be more uncertain within and so there's\n",
    "fundamentally lots of different sort of branches of these autoc curricula that\n",
    "you could use depending on this search objective that you use to drive this\n",
    "exploration process can we contrast this to you know like large language models\n",
    "they are self-supervised learning so you know we we do this self-supervised\n",
    "objective you know which is like you know typically predicting the next word and\n",
    "it's a similar thing with um self-supervised um image um learning now the\n",
    "difference is with that is you're talking about a principled way of you know\n",
    "seeking specific information you know with um let's say high high entropy and\n",
    "that would lead to an imp curricular whereas with language modeling language\n",
    "modeling there is no implicit curricular but I might argue that there kind of is\n",
    "because the way the model does this continual learning um it might regularize\n",
    "itself so if you give it sort of surprising and weird information the language\n",
    "model might just kind of brush it off and if you reinforce things that it\n",
    "already knows then it's almost like a a stream of channels you know it'll say\n",
    "okay you know go go and go and and pay attention to that so it's almost like\n",
    "it's implicit yeah and I would say that in some ways it's almost explicit in\n",
    "terms of how we design these systems um a lot of times like if you look at uh\n",
    "for example open ai's job listings they're actually hiring specifically for\n",
    "experts in different domains to essentially create the next batch of supervised\n",
    "data to train or instruction tune their models on uh for example they hire\n",
    "biologists or they hire people with legal expertise to generate this data and um\n",
    "you can think of this essentially as a human steered or human driven version of\n",
    "this active sampling process right because it's essentially they know that the\n",
    "model uh tends to get high perplexity or they don't it doesn't perform as well\n",
    "on this domain of tasks it doesn't get as high of an LSAT score as it could and\n",
    "so you can essentially you know it's it's beyond an algorithm at this point\n",
    "right it's kind of the super algorithm where you have the system designers now\n",
    "also being part of the data collection process and in a way um supervised\n",
    "learning is really just sort of one point in a continual learning process where\n",
    "you know classically we just looked at one step of this which is here's a batch\n",
    "of data train on that but really um building machine Learning Systems especially\n",
    "nowadays everything's in production these are all live systems you have have to\n",
    "keep it up to date you have to keep it continually generalizing to new knowledge\n",
    "um like Chachi PT or Claude or Gemini and so really it's sort of this pattern\n",
    "over and over again in sequence where you collect a batch of data train your\n",
    "model on that collect the next batch of data you know continue training your\n",
    "model on that um and really you want to be selective about what the next batch\n",
    "of data is because obviously if you just retrain it on the previous batch of\n",
    "data um it's going to overfit to that data uh Beyond a few epochs or uh it's not\n",
    "going to you know get as much novel information from it just because it's\n",
    "already trained on it so you do want to selectively actively collect the data\n",
    "and so I think we kind of almost explicitly already do this at a systems level\n",
    "um and I think the next Frontier is really just having systems that self-improve\n",
    "in this way where they can start to guide more of their own active data\n",
    "collection I love this way of thinking about it you know like gbt for is a mtic\n",
    "intelligence it's not just like you know a bunch of weights on on a on a on a\n",
    "server somewhere and so you could argue you know there this concept called\n",
    "graduate student descent which is what happens in in Academia or even as you\n",
    "just articulated with open AI it's a little bit like an epic Mechanical Turk\n",
    "right where you know um they are monitoring the logs they know when things go go\n",
    "badly and then they lean into it in the same way you are they they go in higher\n",
    "experts and they kind of like add more and more data in all of the holes and\n",
    "eventually there are no more pockets of like abject failure it just it just\n",
    "appears to work really well for everyone and people start to say that it's you\n",
    "know generally intelligent so yeah so there's there interesting systems if you\n",
    "of of of intelligence yeah it kind of starts to mimic just the scientific\n",
    "process in a way uh where we're sort of we we're putting a lot of Hope in the\n",
    "model to basically be able to distill uh information from sort of the net new\n",
    "batch of data that we collect um you know that we know the model currently\n",
    "doesn't explain well and we we we put a lot of faith in gradient descent in\n",
    "order to basically be able to come up with updates to the weights that better\n",
    "explain that data so we're kind of we're kind of already treating this system as\n",
    "almost like an automated um scientist or an automated version of this like\n",
    "continual process of creating theories and explanations about the world um but\n",
    "of course you know um humans are still much better at language models at doing\n",
    "this uh or large models at doing this so I do think there clearly seems like a\n",
    "huge gap in terms of what we still of work that needs to be done in order to\n",
    "build systems that can actually build much more robust theories uh based on like\n",
    "net new data and even seeking that out as humans do interesting and and\n",
    "certainly you know in this broader mtic intelligence we are still the sources of\n",
    "a gency but um we we were just sort of talking a minute ago about there being\n",
    "two types of AI you know that there's there's an AI where we are the generating\n",
    "sources of agency but there might potentially be another AI in the future where\n",
    "that that is the generating source of agency yeah I I so I think that um this\n",
    "kind of ties into my my the framework I personally use to think about open-ended\n",
    "systems as well uh where I think that you know at a high level you can you can\n",
    "study AI sort of in silicone you can study it in systems that you control that\n",
    "you design and that you try to like have the AI model self-improve within and so\n",
    "you can try to build uh systems that self-improve within silico and that's going\n",
    "to lead to potentially some issues around like the grounding problem where\n",
    "essentially it starts to the auto the autoc curricular exploratory process\n",
    "starts to Veer into Parts pockets of the design space that are not relevant to\n",
    "tasks you care about um and so this kind of the danger of like generating\n",
    "open-ended systems in silico and I think it's very similar to potential dangers\n",
    "of generating AGI in silico um and I think the alternative is really just what\n",
    "are existing intelligent systems and how do we actually amplify the efficiency\n",
    "the efficacy of those systems the intelligence within those systems and so you\n",
    "can kind of think of like sort of the entire Enterprise of AI research as do we\n",
    "want to generate like AI or intelligence from scratch or do we want to build\n",
    "tools uh you know motivated or inspired by human intelligence and other\n",
    "intelligence systems and use that to further amplify existing intelligence like\n",
    "human creativity human intelligence could you argue because if intelligence is a\n",
    "Divergent search process you might be tempted to think that well if we had loads\n",
    "of tools to help us share the models and help other people discover the models\n",
    "that I've created and that will help us generally be more intelligent but could\n",
    "you make the counter argument that I'm actually sequestering agency or stealing\n",
    "agency from other people because rather than thinking for themselves and\n",
    "discovering novel models they're just going to use my model yeah I mean I think\n",
    "that in the best case scenario you're Building Systems that essentially you know\n",
    "not not you know to to think about how you know as existing systems nowadays can\n",
    "build on the shoulders of foundation models you really want the to build models\n",
    "where even humans can stand on their shoulders where the humans can basically\n",
    "leverage the existing expertise or automative capabilities of those models to\n",
    "then like move further beyond what they're naturally capable of doing and really\n",
    "that pushes the frontier of the knowledge that we can create as a civilization\n",
    "and so you're already starting to see this where there's some recent studies\n",
    "that show for example like Junior software Engineers that use systems like um\n",
    "chaty BT to help them with coding at work they actually now are starting to\n",
    "match the performance of more senior Engineers uh because it sort of levels the\n",
    "playing field but that also translates into just like uh net more productivity\n",
    "per software engineer and so um I think that it's more just unlocking sort of uh\n",
    "existing bottleneck and how productive each individual can be and really just\n",
    "means that each individual can create a lot more value can discover a lot more\n",
    "knowledge um than before okay but I mean do do you think that it creates a\n",
    "tendency towards boilerplate though so we're more we're more efficient at doing\n",
    "things that exist but you know like on on the frontier we might have a Slowdown\n",
    "there's definitely the danger that it can lock you in to certain patterns right\n",
    "so basically if chpt always returns a certain boiler plate that might have an\n",
    "anti- pattern in it um if that stays around it could self- amplify and then\n",
    "future generations of programmers might just adopt that by default because it's\n",
    "what's already generated by autocomplete so I think that that's also another\n",
    "really interesting realm of questions which is basically how do you um how do\n",
    "you avoid these kinds of uh these local Optima when you start to train a model\n",
    "on its own outputs and I think again like sort of the solution will start to\n",
    "look like some form of novelty search or exploration makes sense okay um what do\n",
    "you guys think about like um you know acade Academia versus industry and um some\n",
    "say there's a bit of a brain drain from Academia totally yeah I think there's\n",
    "like a very very clear trade-off between the two in the S they both have like\n",
    "fantastic things going for them and I guess the trade-off being you know\n",
    "academic freedom and Academia and be able to like individually pursue ideas like\n",
    "purely for curiosity sake and um you know that's something I've really loved\n",
    "about Academia but I guess you know I guess the general Trend and and machine\n",
    "learning research at the moment is kind of towards like larger scale projects\n",
    "especially you know a lot of the properties that we might want to see kind of\n",
    "only emerge when you expend a lot of compute and therefore you know a lot of\n",
    "interesting research can kind of maybe not only be done on an industry but it's\n",
    "a lot easier to do some kinds of research in industry and so I think this kind\n",
    "of leads this trade-off of do you want freedom or do you want to be on these\n",
    "like larger projects that are potentially more impactful and so yeah I've really\n",
    "struggled with that trade-off I think they they both have big pros and cons I\n",
    "don't know what you think mie yeah I I think that um industry is I I think like\n",
    "at a very like first word rough approximation would be to say that industry\n",
    "focuses much more on um exploitation and Academia is where you know in principle\n",
    "you should get a lot more exploration um but I I do think that currently uh both\n",
    "systems are kind of like entwined in the same sort of reward function at a high\n",
    "level where essentially um you know if if if you're if you care a lot about\n",
    "citations and a short-term greedy algorithm for maximizing C citations would be\n",
    "to focus your research efforts on uh sort of whatever topic is uh trendy or\n",
    "hyped at the current time and so like I think you see tons of people obviously\n",
    "working on language models partly because it really is a fascinating subject and\n",
    "it really is like the most powerful form of deep learning we have so I\n",
    "understand why everyone's working on it but I also think that um a lot of it is\n",
    "kind of you you do get this sort of Rich gets richer effect around different\n",
    "topics that people tend to gravitate towards and you lose a lot of the\n",
    "exploration that you should otherwise have um and it's partly because you know\n",
    "like both industry and Academia are at some level optimizing for a similar um\n",
    "sort of reputational status or citation count sort of metric um and so I think\n",
    "that's an issue but I also think that in some ways uh industry you could say has\n",
    "a additional benefit where I do think that from like a short-term point of view\n",
    "industry is better poised to make certain um higher impact research not just\n",
    "because of the resources available to industry but also partly because um sort\n",
    "of Industry uh you know rides or dies based on whether the actual research\n",
    "artifact you produce uh is useful and so I think that's like a very powerful\n",
    "reward function that is not necessarily true for Academia um and then sort of on\n",
    "the to take the counter position I think Academia obviously you know you have a\n",
    "lot more freedom to just explore ideas that don't need to be on that critical\n",
    "path for Value creation immediately and so gives you a lot more scope to\n",
    "potentially find like the next big thing and so I think really it's about like\n",
    "if you want to if you want to take the bet that you can you know play a part in\n",
    "disc the next big thing then and that's that's suited to your taste for research\n",
    "then Academia makes more sense uh but if you know um you want to you want to\n",
    "maximize the probability you'll have a higher impact in sort of like a near\n",
    "horizon line of work then industry is definitely I think a better bet Rich\n",
    "Sutton you know he he had this bitter lesson essay and he made the argument that\n",
    "it's just all computation and there are no shortcuts and you can even think of\n",
    "you know maybe we're not very intelligent um Evolution has just been running for\n",
    "a very very long time and we are the result of that so in in a sense do you\n",
    "think that we could make strides in intelligence you know just through Ingenuity\n",
    "or are we always going to need loads of computer power this definitely like\n",
    "makes me think of like the recent Trend that we've been seeing even in like kind\n",
    "of the reinforcement learning literature lately which is like these kind of\n",
    "large scale like mostly industry projects that are kind of they're even ditching\n",
    "the idea of doing like sequential decision Mak so you know you have all these Al\n",
    "that are like you know optimal planning and so forth but we're kind of seeing a\n",
    "trend towards you know even ditching that complexity of algorithm and just going\n",
    "straight to just copy what the human did and so kind of reducing the problem to\n",
    "you know essentially no real algorithmic um Innovation and more just like can\n",
    "you gather enough expert data and I think yeah I guess the reason why that trend\n",
    "is occurring is is I guess like you said there kind of been you know the B\n",
    "lesson kind of said that you know just being able to SC with more data and more\n",
    "compute as kind of the most important thing and a lot of the more complex\n",
    "algorithms especially around like reinforcement learning are actually like quite\n",
    "challenging to scale up especially like online reinforcement learning if you\n",
    "want to go out and like actually have an agent like actively collecting data in\n",
    "a bunch of different environments and updating itself online like that's so much\n",
    "like engineering infrastructure to set up and so I think there's this this trend\n",
    "towards just like the simplest algorithm possible which is like not even\n",
    "reinforcement learning not even planning just copy an expert but I think that\n",
    "that's like you kind of said um earlier with like this kind of like short-term\n",
    "exploitation I think this is you know it it kind of makes sense to exploit this\n",
    "now and push it as far as possible because you know it's very easy to just train\n",
    "a large Transformer and then gather as much data as possible and I think in\n",
    "areas like robotics we haven't really seen like how far can that go like can you\n",
    "actually get a generally useful robotics platform just by gathering more expert\n",
    "demonstrations and training a larger and larger Transformer and so I think it\n",
    "does kind of make sense that why like a lot of Industry projects are pursuing\n",
    "that because we don't really know you know will will that actually hit a\n",
    "bottleneck or or if you just gather enough data will that will that kind of be\n",
    "sufficient and I guess like you know you could argue that I think it's probably\n",
    "true that there must be a better algorithm out there that that can in principle\n",
    "do this in a more efficient way but I guess if it's just easier to just gather\n",
    "more data and just do imitation learning I can see that there's at least a\n",
    "business case for trying that um so I guess I'm on the the opinion of like you\n",
    "know there must be a more efficient way of getting to like a more intelligent\n",
    "system but it's not necessarily clear that just scaling like raw supervised\n",
    "learning or unsupervised learning like won't get you there and so it it does\n",
    "make sense to pursue that first but kind of what I hope and expect to see is\n",
    "that eventually pure imitation learning or pure unsupervised learning will kind\n",
    "of run out of steam and everything will Plateau and I think at that point you\n",
    "know then these like more complicated algorithms about Gathering more data\n",
    "reinforcement learning planning Etc will really come into their own and so I\n",
    "guess this again relates back to like the Academia industry trade-off like you\n",
    "know a lot of the products in Industry are just going to kind of be exploiting\n",
    "gathering data right now whereas maybe there's a lot of scope to do these kind\n",
    "of more expor exploratory projects where maybe that will get you to like the\n",
    "next Frontier a few years down the line um I don't know what you think about\n",
    "this yeah I definitely think that um yeah just like treating everything as just\n",
    "supervised learning it does tend to work because we have large data sets but um\n",
    "I think again like the challenge is just at some point we will run out of tokens\n",
    "we'll run out of data to train on um and so that's why these self-improving more\n",
    "self-exploratory systems will be more and more I think Paramount to like driving\n",
    "performance even further so if we want to sort of break Beyond sort of the token\n",
    "limit of like the data that's available now we actually need these systems to\n",
    "generate their own tokens their own synthetic data um and that's that's where\n",
    "like the self-play autoc curricular exploration types of algorithms will start\n",
    "to um become more and more prominent and obviously you need an environment in\n",
    "which to do that exploration and that's where the world model um line of\n",
    "research is going to be very powerful just because uh that allows you to really\n",
    "sort of milk all of the value within uh the existing previous data you have seen\n",
    "by creating these Role Models where you might be able to do like counterfactual\n",
    "trajectories and really learn much more um amplify the existing data you had\n",
    "yeah I mean I think one of one of the key things for me um is modeling Dynamics\n",
    "so um it's quite interesting actually with the human knowledge thing so even\n",
    "looking at the Innovations from from Deep Mind you know early versions of of\n",
    "alphao were bootstrapped with human knowledge and then there was the alpha zero\n",
    "so it was actually doing what we're were talking about it was actually\n",
    "discovering knowledge on its own and um in principle that's a great idea but of\n",
    "course like any restricted domain it's tractable but in in the real world it\n",
    "isn't and I'm not sure whether it makes sense to use the the computation and met\n",
    "you know information metaphor for the real world and humans and so on but but\n",
    "the basic idea is that we are all real agents the universe is a massive computer\n",
    "we're discovering all of this knowledge and then we're bootstrapping that into\n",
    "um a machine learning algorithm and then the question is well if you kind of\n",
    "just capture the thing now without the Dynamics that produced it um will the\n",
    "system be robust and could you still um you know kind of carry on as we were in\n",
    "the real world if if that makes sense so um but yeah the interesting thing with\n",
    "the work you've done is is that you are modeling agential systems and you are\n",
    "modeling Dynamics but could that be used for you know much more complex tasks\n",
    "like the real world like simulating much more complex systems in the real world\n",
    "exactly yeah I think that if you if you so I think that just purely imitation\n",
    "learning alone is not really going to get you there um but I think that if you\n",
    "can if you can uh imitate so one is sort of finding the set of tasks I think\n",
    "that uh if you find the set of tasks or reward functions that could be relevant\n",
    "then you can start to simulate things that are otherwise really hard to capture\n",
    "by just purely imitating historical trajectories so for example strategic\n",
    "adaptation type of behaviors are really hard because those are sort of an\n",
    "open-ended space of behaviors where if you basically have like a stock market\n",
    "for example that's a really good example where if you have a stock market that's\n",
    "a very open-ended system and like different Traders will have different\n",
    "strategies that are best responses to each other and then over time the set of\n",
    "strategies evolves over time in an open-ended way um you know trading strategies\n",
    "that worked 10 years ago probably won't work very well today because people have\n",
    "sort of um they they've sort of figured out uh those strategies and so they\n",
    "won't be very competitive and so um I don't see an IM an imitation learning\n",
    "system being able to sort of um generalize to that level of complexity just\n",
    "because by definition it's imitating previous uh trajectories and therefore\n",
    "strategies so I think you need some of like a um a more uh more interactive\n",
    "trial and error learning that allows for strategic adaptation and that requires\n",
    "some notion of a payoff or a reward and so you kind of need to have this this\n",
    "idea of um you you can't just purely I think learn uh a model of something like\n",
    "the stock market just based on previous data you really need to have more\n",
    "inductive biases around uh sort of you know what creates a payoff or what the\n",
    "actual reward function is for each of the Traders uh but that might be something\n",
    "that you could um you could learn over time but I but maybe not in the yeah so\n",
    "sry this is kind of like this not very coherent but I feel like uh you might\n",
    "need something that looks more like learning over a space of programs that\n",
    "starts to Encompass different kinds of uh tasks and then you can basically\n",
    "simulate those tasks to completion with agents that can essentially uh try to\n",
    "self-improve against other agents the stock market I think is a wonderful\n",
    "metaphor for what we're talking about and for for two reasons first of all from\n",
    "the grounding reason because you know like the the the mtic world is very\n",
    "ungrounded and that's why we develop as humans lots of weird shared delusions\n",
    "about things because it's actually like you know it can go in it can go in\n",
    "almost any direction and also the concept of alpha I think is really important\n",
    "because a trading strategy works really well today and then when other people\n",
    "learn about it it no longer provides an advantage because everyone else knows\n",
    "about it and I feel it's the same with language models so you know like GPT 4\n",
    "Pros was really novel and cool it it was great to you know have like a TED Talk\n",
    "speech when it came out and now it doesn't seem cool anymore because everyone's\n",
    "using it on LinkedIn so it's almost like that that we need to have this in like\n",
    "continuous creative evolving process producing new sources of Alpha and the\n",
    "Paradox is that if everyone has access to the same model it can't be a source of\n",
    "alpha by definition yeah I guess on that like topic because we kind of talked\n",
    "about like synthetic data earlier and you kind of said like you know one one\n",
    "mechanism towards getting like a kind of self-improving system that is able to\n",
    "kind of you know continue to improve is to kind of like fil so the synthetic\n",
    "data for example so we might kind of you know have the the new system and then\n",
    "we generate some more data and then we kind of have some like filtering\n",
    "mechanism to say that you know in the current stock market this is this is good\n",
    "data or what you know whatever system we're thinking about and and then we can\n",
    "kind of like use that to enable the model to improve um you know and adapt to\n",
    "the new system but something I've always like like thought about is like or I\n",
    "guess one is is it really trivial to be able to like filter that you know new\n",
    "synthetic data and then two it feels like if you're just relying on like fil\n",
    "filtering existing synthetic data like is isn't that inevita going to kind of\n",
    "plateau and so I guess event you know we talked about how you kind of said that\n",
    "you do actually actively not need to go out and get real more real data but I\n",
    "guess I'm kind of asking you do you think this idea of just like filtering\n",
    "synthetic data from a model is kind of sufficient to always be able to adapt and\n",
    "improve or is it always going to be a mixture of like more real data plus\n",
    "synthetic data filtering I think it's the letter just because um at some point\n",
    "you would expect that uh the synthetic data you do generate it'll start to sort\n",
    "of uh saturate like what's already in the model um just cuz the model is trained\n",
    "on a finite amount of information so at some point you're just going to start to\n",
    "see more and more um especially like the more likely trajectories or sequences\n",
    "of samples you'll start to see that uh more and more and so you're not really\n",
    "going to be very sample efficient in terms of searching for the synthetic data\n",
    "so can can you tell us about the results of the paper totally Yeah so basically\n",
    "we evaluate this um this algorithm on a bunch of like synthetic simulated\n",
    "domains kind of like robotics related tasks um and kind of yeah environments\n",
    "where there's like varying levels of complexity so you know you might have a\n",
    "robot pushing around a variable number of like objects or maybe you have\n",
    "different terrain that the robot might want to um um learn to kind of you know\n",
    "do Locomotion over and things like this um and so kind of you know the main\n",
    "comparison we make is like how well does Waker work relative to like naive\n",
    "domain randomization so how well does it work if you just like uniformly sample\n",
    "the space of environments versus if you do actively seek out the environments\n",
    "that have this like higher uncertainty um and so basically what we show is that\n",
    "you know if we do the Waker approach we still do like very well on average but\n",
    "we consistently do better in terms of robustness and so robustness by robustness\n",
    "I mean here that that we do better in terms of the worst environments um that\n",
    "the agent is evaluated under and so this kind of means you know if the agent is\n",
    "able to do well in the worst environments that it that it is evaluated under\n",
    "that kind of shows that it's able to do well across all environments because its\n",
    "worst performance is still good um so you kind of this this shows that we we\n",
    "achieve this like robustness property um which we talked about in terms of like\n",
    "Mini Max regret um but we EV we don't evaluate it in terms of like the true\n",
    "notion of Minimax regret because as as we talked about earlier actually\n",
    "evaluating regret exactly is difficult because that that that requires knowing\n",
    "the exact true Optimal Performance um which isn't something we can really know\n",
    "so instead we we just show that you know the agent performs well across all\n",
    "environments more so than if you just like naively sampled the environments\n",
    "uniformly and in terms of decomposing the performance across the spectrum of\n",
    "possible environments so like you know the ideal situation is that we have a\n",
    "very simple model which just generalizes so we happen to have found the golden\n",
    "Motif you know there's a spectrum of correlations almost all of them are Furious\n",
    "but we've just you know just by through some sheer magic we found the best Motif\n",
    "to work in all situations probably that's not quite true probably there are some\n",
    "good generalizing motifs and the model has also kind of like memorize the long\n",
    "tail and that there's some degree of like you know it works really well on on\n",
    "the test set but might not out of domain distribution do you have any like way\n",
    "of reasoning about what that is um so yeah I agree I guess there's like not\n",
    "necessarily it's not necessarily the case that by like focusing more on these\n",
    "like longtail examples that's necessarily the best way of of training the best\n",
    "model because like you said like maybe it happens to be the case that if the\n",
    "model is trained on some certain subset of the task like that will actually\n",
    "generalize better but but I think in practice that's not something we can really\n",
    "um really know how to you know like optimally select the the best kind of set of\n",
    "tasks that will generalize well and so we do focus more on on like you know\n",
    "these these kind of longtail tasks like the ones that we might see rarly and\n",
    "therefore have high uncertainty about um in terms of like the the out of\n",
    "distribution generalization so so we do also do some experiments like looking at\n",
    "how well does the model generalize out of distribution um and basically what we\n",
    "show is that if we train the model in this way and then we give it some more\n",
    "environments that hasn't seen at test time um if the environments are more\n",
    "complex then then we've seen sorry hasn't seen it training time basically like\n",
    "this model then generalizes better to outof distribution environments that are\n",
    "like more complex which is kind of what you'd expect cuz we've kind of bias\n",
    "something towards more complexity we're able to generalize better to out of\n",
    "distribution environments that have higher complexity and then guess the\n",
    "question is like do we care about out of distribution environments that have\n",
    "higher complexity like what about the out of distribution environments that have\n",
    "lower complexity um and I would argue that you know basically the lower en out\n",
    "dist distribution environments that have lower complexity like we would already\n",
    "expect that the model is able to do very well at so so there's not really much\n",
    "of a difference there because you know almost any reasonably trained model can\n",
    "handle the very simplest environment so what we really care about is can we\n",
    "generalize out of distribution to like higher complexity environments and so by\n",
    "biasing the suling towards the higher complexity environments we do show that\n",
    "we're able to generalize further out of distribution to even higher complexity\n",
    "environments okay but is there any way of knowing whether it's kind of like\n",
    "memorizing the high complexity instances or whether it's still learning abstract\n",
    "motifs and generalizing between them yeah that's a great question I think that's\n",
    "a really interesting question generally for ML as a field right now which is\n",
    "better evaluation benchmarks for uh generalization within different kinds of\n",
    "models um and like we we alluded to earlier there's kind of this uh issue of\n",
    "data leakage between training and test set which is um which is definitely an\n",
    "issue that is currently happening with large language models um it doesn't take\n",
    "away from the impressiveness of these models CU clearly there is a strong\n",
    "generalization aspect to their behavior but I do think that in terms of\n",
    "measuring performance on specific benchmarks um we really need to solve this\n",
    "problem how do we have these clean data sets uh that allow us to to truly test\n",
    "on inputs that the model hasn't seen at training um I think in the case of uh\n",
    "reinforcement learning um that's a bit more difficult just because usually we\n",
    "focus on a particular task domain and so there's always going to be some shared\n",
    "similarities within task but obviously uh we didn't do this in this paper but we\n",
    "could try things where we have more um more controlled uh settings where we you\n",
    "know change one aspect of the environment and really uh see if it's learning\n",
    "specific causal relationships between um things that have to be accomplished in\n",
    "that task uh but we didn't do that um that actually think would be a really\n",
    "interesting idea for uh a new evaluation environment for RL yeah I mean the\n",
    "benchmarks thing is just a huge challenge in in machine learning um in general\n",
    "but just just to kind of round off off the interview I mean M you were talking\n",
    "about you're doing some work with um Ed grinstead and he's an amazing guy I'm\n",
    "get getting Ed back on and um you said that um been looking into this kind of\n",
    "the interface between humans and and machine learning can you tell me about that\n",
    "yeah so just to not say too much about it because um it's related to current\n",
    "work that's happening at Deep Mind um is just that you know I think from\n",
    "personally from a high level point of view I'm very interested you know talking\n",
    "about this divide sort of this fork in the road in terms of what's the path to\n",
    "open studying open-endedness studying it in silico or studying it in situ in the\n",
    "setting of an actual open-ended system like a user um app interaction or you\n",
    "know the interaction between a user and a piece of software on the web uh or\n",
    "potentially with many other users there are such Rich um existing systems online\n",
    "that are already open-ended because they amplify or connect the creativity and\n",
    "knowledge of humans um to create more knowledge and more creative artifacts and\n",
    "so I think what's really uh interesting in my mind now is sort of studying uh\n",
    "systems or algorithms that allow us to better steer the creativity of humans uh\n",
    "as they are uh mediated by software um and basically allow us to essentially\n",
    "amplify existing intelligent or creative systems that are open-ended so amplify\n",
    "existing open open-endedness rather than try to build it from scratch amazing\n",
    "guys it's been an honor to have you on MLS thank you so much thanks so much yeah\n",
    "great cool yeah we're done\n",
    "\"\"\"\n",
    "\n",
    "chunks = splitter.chunks(transcript)\n",
    "for i, chunk in enumerate(chunks):\n",
    "  print(f\"Chunk {i+1}:\\n{chunk}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Messages: [{\"role\":\"system\",\"content\":\"you are a helpful assistant\\nyour name is Winston\\nyou are a helpful assistant\\nyour name is Winston\\nYou are a helpful assistant\\nYour name is Winston\\nYou are a helpful assistant\\nYour name is Winston\"},{\"role\":\"user\",\"content\":\"You are a helpful assistant\\nYour name is Winston\"}]\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "prompt = \"\"\"\n",
    "you are a helpful assistant\n",
    "your name is Winston\n",
    "you are a helpful assistant\n",
    "your name is Winston\n",
    "You are a helpful assistant\n",
    "Your name is Winston\n",
    "You are a helpful assistant\n",
    "Your name is Winston\n",
    "\"\"\".strip()\n",
    "\n",
    "prompt_input = \"\"\"\n",
    "You are a helpful assistant\n",
    "Your name is Winston\n",
    "\"\"\".strip()\n",
    "\n",
    "messages = [\n",
    "  {\"role\": \"system\", \"content\": prompt},\n",
    "  {\"role\": \"user\", \"content\": prompt_input},\n",
    "]\n",
    "print(\n",
    "  f\"Messages: {json.dumps(messages, separators=(',', ':'))}\"\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
